Open Vimos opened 7 years ago
Is it possible to share the ORC file? I can try to take a look at it.
I am sorry, I am not permitted to send you the data. I may offer more debug info from Java.
/usr/lib/jvm/java-8-oracle/bin/java -agentlib:jdwp=transport=dt_socket,address=127.0.0.1:32805,suspend=y,server=n -Dfile.encoding=UTF-8 -classpath /usr/lib/jvm/java-8-oracle/jre/lib/charsets.jar:/usr/lib/jvm/java-8-oracle/jre/lib/deploy.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/cldrdata.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/dnsns.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/jaccess.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/jfxrt.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/localedata.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/nashorn.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/sunec.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/sunjce_provider.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/sunpkcs11.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/zipfs.jar:/usr/lib/jvm/java-8-oracle/jre/lib/javaws.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jce.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jfr.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jfxswt.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jsse.jar:/usr/lib/jvm/java-8-oracle/jre/lib/management-agent.jar:/usr/lib/jvm/java-8-oracle/jre/lib/plugin.jar:/usr/lib/jvm/java-8-oracle/jre/lib/resources.jar:/usr/lib/jvm/java-8-oracle/jre/lib/rt.jar:/home/vimos/Public/github/ml/python-orc/java-gateway/target/classes:/data/home/vimos/.m2/repository/net/sf/py4j/py4j/0.10.2.1/py4j-0.10.2.1.jar:/data/home/vimos/.m2/repository/org/apache/orc/orc-core/1.1.1/orc-core-1.1.1.jar:/data/home/vimos/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar:/data/home/vimos/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/data/home/vimos/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar:/data/home/vimos/.m2/repository/org/apache/hadoop/hadoop-common/2.6.0/hadoop-common-2.6.0.jar:/data/home/vimos/.m2/repository/org/apache/hadoop/hadoop-annotations/2.6.0/hadoop-annotations-2.6.0.jar:/usr/lib/jvm/java-8-oracle/lib/tools.jar:/data/home/vimos/.m2/repository/org/apache/commons/commons-math3/3.1.1/commons-math3-3.1.1.jar:/data/home/vimos/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar:/data/home/vimos/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar:/data/home/vimos/.m2/repository/commons-codec/commons-codec/1.4/commons-codec-1.4.jar:/data/home/vimos/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.jar:/data/home/vimos/.m2/repository/commons-net/commons-net/3.1/commons-net-3.1.jar:/data/home/vimos/.m2/repository/commons-collections/commons-collections/3.2.1/commons-collections-3.2.1.jar:/data/home/vimos/.m2/repository/com/sun/jersey/jersey-core/1.9/jersey-core-1.9.jar:/data/home/vimos/.m2/repository/com/sun/jersey/jersey-json/1.9/jersey-json-1.9.jar:/data/home/vimos/.m2/repository/com/sun/xml/bind/jaxb-impl/2.2.3-1/jaxb-impl-2.2.3-1.jar:/data/home/vimos/.m2/repository/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2.jar:/data/home/vimos/.m2/repository/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.jar:/data/home/vimos/.m2/repository/javax/activation/activation/1.1/activation-1.1.jar:/data/home/vimos/.m2/repository/org/codehaus/jackson/jackson-jaxrs/1.8.3/jackson-jaxrs-1.8.3.jar:/data/home/vimos/.m2/repository/org/codehaus/jackson/jackson-xc/1.8.3/jackson-xc-1.8.3.jar:/data/home/vimos/.m2/repository/com/sun/jersey/jersey-server/1.9/jersey-server-1.9.jar:/data/home/vimos/.m2/repository/asm/asm/3.1/asm-3.1.jar:/data/home/vimos/.m2/repository/tomcat/jasper-compiler/5.5.23/jasper-compiler-5.5.23.jar:/data/home/vimos/.m2/repository/tomcat/jasper-runtime/5.5.23/jasper-runtime-5.5.23.jar:/data/home/vimos/.m2/repository/commons-el/commons-el/1.0/commons-el-1.0.jar:/data/home/vimos/.m2/repository/commons-logging/commons-logging/1.1.3/commons-logging-1.1.3.jar:/data/home/vimos/.m2/repository/log4j/log4j/1.2.17/log4j-1.2.17.jar:/data/home/vimos/.m2/repository/net/java/dev/jets3t/jets3t/0.9.0/jets3t-0.9.0.jar:/data/home/vimos/.m2/repository/org/apache/httpcomponents/httpclient/4.1.2/httpclient-4.1.2.jar:/data/home/vimos/.m2/repository/org/apache/httpcomponents/httpcore/4.1.2/httpcore-4.1.2.jar:/data/home/vimos/.m2/repository/com/jamesmurty/utils/java-xmlbuilder/0.4/java-xmlbuilder-0.4.jar:/data/home/vimos/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar:/data/home/vimos/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar:/data/home/vimos/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar:/data/home/vimos/.m2/repository/commons-beanutils/commons-beanutils-core/1.8.0/commons-beanutils-core-1.8.0.jar:/data/home/vimos/.m2/repository/org/slf4j/slf4j-log4j12/1.7.5/slf4j-log4j12-1.7.5.jar:/data/home/vimos/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.9.13/jackson-core-asl-1.9.13.jar:/data/home/vimos/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.9.13/jackson-mapper-asl-1.9.13.jar:/data/home/vimos/.m2/repository/com/google/code/gson/gson/2.2.4/gson-2.2.4.jar:/data/home/vimos/.m2/repository/org/apache/hadoop/hadoop-auth/2.6.0/hadoop-auth-2.6.0.jar:/data/home/vimos/.m2/repository/org/apache/directory/server/apacheds-kerberos-codec/2.0.0-M15/apacheds-kerberos-codec-2.0.0-M15.jar:/data/home/vimos/.m2/repository/org/apache/directory/server/apacheds-i18n/2.0.0-M15/apacheds-i18n-2.0.0-M15.jar:/data/home/vimos/.m2/repository/org/apache/directory/api/api-asn1-api/1.0.0-M20/api-asn1-api-1.0.0-M20.jar:/data/home/vimos/.m2/repository/org/apache/directory/api/api-util/1.0.0-M20/api-util-1.0.0-M20.jar:/data/home/vimos/.m2/repository/org/apache/curator/curator-framework/2.6.0/curator-framework-2.6.0.jar:/data/home/vimos/.m2/repository/com/jcraft/jsch/0.1.42/jsch-0.1.42.jar:/data/home/vimos/.m2/repository/org/apache/curator/curator-client/2.6.0/curator-client-2.6.0.jar:/data/home/vimos/.m2/repository/org/apache/curator/curator-recipes/2.6.0/curator-recipes-2.6.0.jar:/data/home/vimos/.m2/repository/org/htrace/htrace-core/3.0.4/htrace-core-3.0.4.jar:/data/home/vimos/.m2/repository/org/apache/zookeeper/zookeeper/3.4.6/zookeeper-3.4.6.jar:/data/home/vimos/.m2/repository/org/apache/commons/commons-compress/1.4.1/commons-compress-1.4.1.jar:/data/home/vimos/.m2/repository/org/tukaani/xz/1.0/xz-1.0.jar:/data/home/vimos/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.6.0/hadoop-hdfs-2.6.0.jar:/data/home/vimos/.m2/repository/commons-daemon/commons-daemon/1.0.13/commons-daemon-1.0.13.jar:/data/home/vimos/.m2/repository/io/netty/netty/3.6.2.Final/netty-3.6.2.Final.jar:/data/home/vimos/.m2/repository/xerces/xercesImpl/2.9.1/xercesImpl-2.9.1.jar:/data/home/vimos/.m2/repository/xml-apis/xml-apis/1.3.04/xml-apis-1.3.04.jar:/data/home/vimos/.m2/repository/org/apache/hive/hive-storage-api/2.1.0-pre-orc/hive-storage-api-2.1.0-pre-orc.jar:/data/home/vimos/.m2/repository/org/codehaus/jettison/jettison/1.1/jettison-1.1.jar:/data/home/vimos/.m2/repository/stax/stax-api/1.0.1/stax-api-1.0.1.jar:/data/home/vimos/.m2/repository/org/iq80/snappy/snappy/0.2/snappy-0.2.jar:/data/home/vimos/.m2/repository/org/slf4j/slf4j-api/1.7.5/slf4j-api-1.7.5.jar:/data/home/vimos/.m2/repository/com/google/guava/guava/14.0.1/guava-14.0.1.jar:/opt/jetbrains/idea-IU-171.4249.39/lib/idea_rt.jar com.pythonorc.SimplifiedOrcReader
Connected to the target VM, address: '127.0.0.1:32805', transport: 'socket'
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
[log_id, mhotelid, city_id, city_name, city_name_en, province_id, province_name, province_name_en, m_city_id, m_province_id, log_type, uid, source, os_version, card_number, user_name, user_ip, appid, user_agent, trace_id, latitude, longitude, carrier, userinfo_channel, level, model, brand, orderid, proxyid, caller_attr_channel, economic_hotel, fast_filter_keywords, mhotel_ids, return_has_xianfu_hotel, return_has_yufu_hotel, hotel_brand_id, only_limitime_sale, facility_ids, theme_ids, star_rates, district_id, district_type, price_pair, payment_methods, nearby, poi_id, region_id, check_in, check_out, id, executetime, keywords, setkeywords, setbrandid, setstarrates, inner_search_type, hotel_group_id, sorting_method, setfilterattr, mrankflag, mranktype, setnearby, setfastfilter_attr, setpoi_id, sethotel_group_id, response_mhotelids, setprice_pair, star_ratessize, facility_idssize, setdistrict_type, settheme_ids, setdistrict_id, settrace_id, pageindex, pagesize, recreqattrtype, ifun, crawled_flag, geo_type, activity_flag]
80
1149130
Exception in thread "main" java.lang.IllegalArgumentException: Buffer size too small. size = 262144 needed = 3863789
at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:217)
at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:262)
at java.io.InputStream.read(InputStream.java:101)
at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737)
at com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701)
at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99)
at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:10679)
at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:10643)
at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10748)
at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10743)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
at org.apache.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:10976)
at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:165)
at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:236)
at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:849)
at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:820)
at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:977)
at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1012)
at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:212)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:579)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:566)
at com.pythonorc.SimplifiedOrcReader.iterator(SimplifiedOrcReader.java:70)
at com.pythonorc.SimplifiedOrcReader.main(SimplifiedOrcReader.java:285)
Disconnected from the target VM, address: '127.0.0.1:32805', transport: 'socket'
It gives me some hint, let me try to work on it.
I don't have any sample that can produce this error. But if i understand correctly then the bufferSize is set inside the footer of the ORC file. Maybe for some reason, the bufferSize is incorrect in the footer.
Can you help me to checkout this branch add-fetch-filemetainfo
, build again and then fetch reader.fileMetaInfo
and paste me back the information. A sample output would be
{u'metadataSize': u'250', u'compressionType': u'ZLIB', u'writerVersion': u'1', u'versionLists': u'
[0, 12]', u'bufferSize': u'10000'}
This info will help me to debug further into the problem.
I used the orc tools and got this
➜ src git:(master) ./orc-metadata ../../../../../python-orc/dt=2017-05-14_os=android_part=000004_0
{ "name": "../../../../../python-orc/dt=2017-05-14_os=android_part=000004_0",
"type": "struct<log_id:string,mhotelid:string,city_id:string,city_name:string,city_name_en:string,province_id:string,province_name:string,province_name_en:string,m_city_id:string,m_province_id:string,log_type:string,uid:string,source:string,os_version:string,card_number:string,user_name:string,user_ip:string,appid:string,user_agent:string,trace_id:string,latitude:string,longitude:string,carrier:string,userinfo_channel:string,level:string,model:string,brand:string,orderid:string,proxyid:string,caller_attr_channel:string,economic_hotel:string,fast_filter_keywords:string,mhotel_ids:string,return_has_xianfu_hotel:string,return_has_yufu_hotel:string,hotel_brand_id:string,only_limitime_sale:string,facility_ids:string,theme_ids:string,star_rates:string,district_id:string,district_type:string,price_pair:string,payment_methods:string,nearby:string,poi_id:string,region_id:string,check_in:string,check_out:string,id:string,executetime:string,keywords:string,setkeywords:string,setbrandid:string,setstarrates:string,inner_search_type:string,hotel_group_id:string,sorting_method:string,setfilterattr:string,mrankflag:string,mranktype:string,setnearby:string,setfastfilter_attr:string,setpoi_id:string,sethotel_group_id:string,response_mhotelids:array<string>,setprice_pair:string,star_ratessize:string,facility_idssize:string,setdistrict_type:string,settheme_ids:string,setdistrict_id:string,settrace_id:string,pageindex:string,pagesize:string,recreqattrtype:string,ifun:string,crawled_flag:string,geo_type:string,activity_flag:string>",
"rows": 1149130,
"stripe count": 3,
"format": "0.12", "writer version": "original",
"compression": "zlib", "compression block": 262144,
"file length": 116043721,
"content": 116041038, "stripe stats": 3599, "footer": 2549, "postscript": 23,
"row index stride": 10000,
"user metadata": {
},
"stripes": [
{ "stripe": 0, "rows": 575000,
"offset": 3, "length": 57272824,
"index": 72291, "data": 57198680, "footer": 1853
},
{ "stripe": 1, "rows": 510000,
"offset": 57272827, "length": 51277701,
"index": 64624, "data": 51211228, "footer": 1849
},
{ "stripe": 2, "rows": 64130,
"offset": 108550528, "length": 7490510,
"index": 12819, "data": 7475943, "footer": 1748
}
]
}
Using the new branch, I got this.
In [1]: from orcreader import OrcReader
...: reader = OrcReader('dt=2017-05-14_os=android_part=000004_0')
...: reader.open()
...:
In [2]: reader.fileMetaInfo
Out[2]: {u'metadataSize': u'3599', u'compressionType': u'ZLIB', u'writerVersion': u'0', u'versionLists': u'[0, 12]', u'bufferSize': u'262144'}
Yeah. The reader by default will use the blockSize from the metadata, which is "compression block": 262144
The possible option is to manually override the blockSize. I will work on this later today.
Hi, I am trying to read an orc file.
I have successfully get the schema like this
But when I am trying to read rows, it reports the following error
Any suggestions on how to debug this error?