nqbao / python-orc

Python ORC reader
Other
15 stars 4 forks source link

Py4JJavaError: An error occurred while calling o2.iterator. #1

Open Vimos opened 7 years ago

Vimos commented 7 years ago

Hi, I am trying to read an orc file.

In [1]: from orcreader import OrcReader
   ...: reader = OrcReader('dt=2017-05-14_os=android_part=000004_0')
   ...: reader.open()
   ...: 

I have successfully get the schema like this

In [3]: reader.schema()
Out[3]: 
OrderedDict([(u'log_id', u'string'),
             (u'city_id', u'string'),
             (u'city_name', u'string'),
             (u'city_name_en', u'string'),
             (u'province_id', u'string'),
             (u'province_name', u'string'),
 ...  
             (u'activity_flag', u'string')])

But when I am trying to read rows, it reports the following error

In [2]: for row in reader:
   ...:     print row
   ...:     
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-2-df0cbab3b6b5> in <module>()
----> 1 for row in reader:
      2     print row
      3 

/usr/local/lib/python2.7/dist-packages/python_orc-0.0.1-py2.7.egg/orcreader/reader.pyc in __iter__(self)
     79 
     80     def __iter__(self):
---> 81         return OrcRecordIterator(self.reader.iterator())
     82 
     83     def __enter__(self):

/usr/local/lib/python2.7/dist-packages/py4j-0.10.4-py2.7.egg/py4j/java_gateway.pyc in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/local/lib/python2.7/dist-packages/py4j-0.10.4-py2.7.egg/py4j/protocol.pyc in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o2.iterator.
: java.lang.RuntimeException: Unable to init iterator
    at com.pythonorc.SimplifiedOrcReader.iterator(SimplifiedOrcReader.java:72)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)

Any suggestions on how to debug this error?

nqbao commented 7 years ago

Is it possible to share the ORC file? I can try to take a look at it.

Vimos commented 7 years ago

I am sorry, I am not permitted to send you the data. I may offer more debug info from Java.

/usr/lib/jvm/java-8-oracle/bin/java -agentlib:jdwp=transport=dt_socket,address=127.0.0.1:32805,suspend=y,server=n -Dfile.encoding=UTF-8 -classpath /usr/lib/jvm/java-8-oracle/jre/lib/charsets.jar:/usr/lib/jvm/java-8-oracle/jre/lib/deploy.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/cldrdata.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/dnsns.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/jaccess.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/jfxrt.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/localedata.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/nashorn.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/sunec.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/sunjce_provider.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/sunpkcs11.jar:/usr/lib/jvm/java-8-oracle/jre/lib/ext/zipfs.jar:/usr/lib/jvm/java-8-oracle/jre/lib/javaws.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jce.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jfr.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jfxswt.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jsse.jar:/usr/lib/jvm/java-8-oracle/jre/lib/management-agent.jar:/usr/lib/jvm/java-8-oracle/jre/lib/plugin.jar:/usr/lib/jvm/java-8-oracle/jre/lib/resources.jar:/usr/lib/jvm/java-8-oracle/jre/lib/rt.jar:/home/vimos/Public/github/ml/python-orc/java-gateway/target/classes:/data/home/vimos/.m2/repository/net/sf/py4j/py4j/0.10.2.1/py4j-0.10.2.1.jar:/data/home/vimos/.m2/repository/org/apache/orc/orc-core/1.1.1/orc-core-1.1.1.jar:/data/home/vimos/.m2/repository/com/google/protobuf/protobuf-java/2.5.0/protobuf-java-2.5.0.jar:/data/home/vimos/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/data/home/vimos/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar:/data/home/vimos/.m2/repository/org/apache/hadoop/hadoop-common/2.6.0/hadoop-common-2.6.0.jar:/data/home/vimos/.m2/repository/org/apache/hadoop/hadoop-annotations/2.6.0/hadoop-annotations-2.6.0.jar:/usr/lib/jvm/java-8-oracle/lib/tools.jar:/data/home/vimos/.m2/repository/org/apache/commons/commons-math3/3.1.1/commons-math3-3.1.1.jar:/data/home/vimos/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar:/data/home/vimos/.m2/repository/commons-httpclient/commons-httpclient/3.1/commons-httpclient-3.1.jar:/data/home/vimos/.m2/repository/commons-codec/commons-codec/1.4/commons-codec-1.4.jar:/data/home/vimos/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.jar:/data/home/vimos/.m2/repository/commons-net/commons-net/3.1/commons-net-3.1.jar:/data/home/vimos/.m2/repository/commons-collections/commons-collections/3.2.1/commons-collections-3.2.1.jar:/data/home/vimos/.m2/repository/com/sun/jersey/jersey-core/1.9/jersey-core-1.9.jar:/data/home/vimos/.m2/repository/com/sun/jersey/jersey-json/1.9/jersey-json-1.9.jar:/data/home/vimos/.m2/repository/com/sun/xml/bind/jaxb-impl/2.2.3-1/jaxb-impl-2.2.3-1.jar:/data/home/vimos/.m2/repository/javax/xml/bind/jaxb-api/2.2.2/jaxb-api-2.2.2.jar:/data/home/vimos/.m2/repository/javax/xml/stream/stax-api/1.0-2/stax-api-1.0-2.jar:/data/home/vimos/.m2/repository/javax/activation/activation/1.1/activation-1.1.jar:/data/home/vimos/.m2/repository/org/codehaus/jackson/jackson-jaxrs/1.8.3/jackson-jaxrs-1.8.3.jar:/data/home/vimos/.m2/repository/org/codehaus/jackson/jackson-xc/1.8.3/jackson-xc-1.8.3.jar:/data/home/vimos/.m2/repository/com/sun/jersey/jersey-server/1.9/jersey-server-1.9.jar:/data/home/vimos/.m2/repository/asm/asm/3.1/asm-3.1.jar:/data/home/vimos/.m2/repository/tomcat/jasper-compiler/5.5.23/jasper-compiler-5.5.23.jar:/data/home/vimos/.m2/repository/tomcat/jasper-runtime/5.5.23/jasper-runtime-5.5.23.jar:/data/home/vimos/.m2/repository/commons-el/commons-el/1.0/commons-el-1.0.jar:/data/home/vimos/.m2/repository/commons-logging/commons-logging/1.1.3/commons-logging-1.1.3.jar:/data/home/vimos/.m2/repository/log4j/log4j/1.2.17/log4j-1.2.17.jar:/data/home/vimos/.m2/repository/net/java/dev/jets3t/jets3t/0.9.0/jets3t-0.9.0.jar:/data/home/vimos/.m2/repository/org/apache/httpcomponents/httpclient/4.1.2/httpclient-4.1.2.jar:/data/home/vimos/.m2/repository/org/apache/httpcomponents/httpcore/4.1.2/httpcore-4.1.2.jar:/data/home/vimos/.m2/repository/com/jamesmurty/utils/java-xmlbuilder/0.4/java-xmlbuilder-0.4.jar:/data/home/vimos/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar:/data/home/vimos/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar:/data/home/vimos/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar:/data/home/vimos/.m2/repository/commons-beanutils/commons-beanutils-core/1.8.0/commons-beanutils-core-1.8.0.jar:/data/home/vimos/.m2/repository/org/slf4j/slf4j-log4j12/1.7.5/slf4j-log4j12-1.7.5.jar:/data/home/vimos/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.9.13/jackson-core-asl-1.9.13.jar:/data/home/vimos/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.9.13/jackson-mapper-asl-1.9.13.jar:/data/home/vimos/.m2/repository/com/google/code/gson/gson/2.2.4/gson-2.2.4.jar:/data/home/vimos/.m2/repository/org/apache/hadoop/hadoop-auth/2.6.0/hadoop-auth-2.6.0.jar:/data/home/vimos/.m2/repository/org/apache/directory/server/apacheds-kerberos-codec/2.0.0-M15/apacheds-kerberos-codec-2.0.0-M15.jar:/data/home/vimos/.m2/repository/org/apache/directory/server/apacheds-i18n/2.0.0-M15/apacheds-i18n-2.0.0-M15.jar:/data/home/vimos/.m2/repository/org/apache/directory/api/api-asn1-api/1.0.0-M20/api-asn1-api-1.0.0-M20.jar:/data/home/vimos/.m2/repository/org/apache/directory/api/api-util/1.0.0-M20/api-util-1.0.0-M20.jar:/data/home/vimos/.m2/repository/org/apache/curator/curator-framework/2.6.0/curator-framework-2.6.0.jar:/data/home/vimos/.m2/repository/com/jcraft/jsch/0.1.42/jsch-0.1.42.jar:/data/home/vimos/.m2/repository/org/apache/curator/curator-client/2.6.0/curator-client-2.6.0.jar:/data/home/vimos/.m2/repository/org/apache/curator/curator-recipes/2.6.0/curator-recipes-2.6.0.jar:/data/home/vimos/.m2/repository/org/htrace/htrace-core/3.0.4/htrace-core-3.0.4.jar:/data/home/vimos/.m2/repository/org/apache/zookeeper/zookeeper/3.4.6/zookeeper-3.4.6.jar:/data/home/vimos/.m2/repository/org/apache/commons/commons-compress/1.4.1/commons-compress-1.4.1.jar:/data/home/vimos/.m2/repository/org/tukaani/xz/1.0/xz-1.0.jar:/data/home/vimos/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.6.0/hadoop-hdfs-2.6.0.jar:/data/home/vimos/.m2/repository/commons-daemon/commons-daemon/1.0.13/commons-daemon-1.0.13.jar:/data/home/vimos/.m2/repository/io/netty/netty/3.6.2.Final/netty-3.6.2.Final.jar:/data/home/vimos/.m2/repository/xerces/xercesImpl/2.9.1/xercesImpl-2.9.1.jar:/data/home/vimos/.m2/repository/xml-apis/xml-apis/1.3.04/xml-apis-1.3.04.jar:/data/home/vimos/.m2/repository/org/apache/hive/hive-storage-api/2.1.0-pre-orc/hive-storage-api-2.1.0-pre-orc.jar:/data/home/vimos/.m2/repository/org/codehaus/jettison/jettison/1.1/jettison-1.1.jar:/data/home/vimos/.m2/repository/stax/stax-api/1.0.1/stax-api-1.0.1.jar:/data/home/vimos/.m2/repository/org/iq80/snappy/snappy/0.2/snappy-0.2.jar:/data/home/vimos/.m2/repository/org/slf4j/slf4j-api/1.7.5/slf4j-api-1.7.5.jar:/data/home/vimos/.m2/repository/com/google/guava/guava/14.0.1/guava-14.0.1.jar:/opt/jetbrains/idea-IU-171.4249.39/lib/idea_rt.jar com.pythonorc.SimplifiedOrcReader
Connected to the target VM, address: '127.0.0.1:32805', transport: 'socket'
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
[log_id, mhotelid, city_id, city_name, city_name_en, province_id, province_name, province_name_en, m_city_id, m_province_id, log_type, uid, source, os_version, card_number, user_name, user_ip, appid, user_agent, trace_id, latitude, longitude, carrier, userinfo_channel, level, model, brand, orderid, proxyid, caller_attr_channel, economic_hotel, fast_filter_keywords, mhotel_ids, return_has_xianfu_hotel, return_has_yufu_hotel, hotel_brand_id, only_limitime_sale, facility_ids, theme_ids, star_rates, district_id, district_type, price_pair, payment_methods, nearby, poi_id, region_id, check_in, check_out, id, executetime, keywords, setkeywords, setbrandid, setstarrates, inner_search_type, hotel_group_id, sorting_method, setfilterattr, mrankflag, mranktype, setnearby, setfastfilter_attr, setpoi_id, sethotel_group_id, response_mhotelids, setprice_pair, star_ratessize, facility_idssize, setdistrict_type, settheme_ids, setdistrict_id, settrace_id, pageindex, pagesize, recreqattrtype, ifun, crawled_flag, geo_type, activity_flag]
80
1149130
Exception in thread "main" java.lang.IllegalArgumentException: Buffer size too small. size = 262144 needed = 3863789
    at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:217)
    at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:262)
    at java.io.InputStream.read(InputStream.java:101)
    at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737)
    at com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701)
    at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99)
    at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:10679)
    at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:10643)
    at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10748)
    at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10743)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
    at org.apache.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:10976)
    at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:165)
    at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:236)
    at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:849)
    at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:820)
    at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:977)
    at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1012)
    at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:212)
    at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:579)
    at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:566)
    at com.pythonorc.SimplifiedOrcReader.iterator(SimplifiedOrcReader.java:70)
    at com.pythonorc.SimplifiedOrcReader.main(SimplifiedOrcReader.java:285)
Disconnected from the target VM, address: '127.0.0.1:32805', transport: 'socket'
nqbao commented 7 years ago

It gives me some hint, let me try to work on it.

nqbao commented 7 years ago

I don't have any sample that can produce this error. But if i understand correctly then the bufferSize is set inside the footer of the ORC file. Maybe for some reason, the bufferSize is incorrect in the footer.

Can you help me to checkout this branch add-fetch-filemetainfo, build again and then fetch reader.fileMetaInfo and paste me back the information. A sample output would be

{u'metadataSize': u'250', u'compressionType': u'ZLIB', u'writerVersion': u'1', u'versionLists': u'
[0, 12]', u'bufferSize': u'10000'}

This info will help me to debug further into the problem.

Vimos commented 7 years ago

I used the orc tools and got this

➜  src git:(master) ./orc-metadata ../../../../../python-orc/dt=2017-05-14_os=android_part=000004_0
{ "name": "../../../../../python-orc/dt=2017-05-14_os=android_part=000004_0",
  "type": "struct<log_id:string,mhotelid:string,city_id:string,city_name:string,city_name_en:string,province_id:string,province_name:string,province_name_en:string,m_city_id:string,m_province_id:string,log_type:string,uid:string,source:string,os_version:string,card_number:string,user_name:string,user_ip:string,appid:string,user_agent:string,trace_id:string,latitude:string,longitude:string,carrier:string,userinfo_channel:string,level:string,model:string,brand:string,orderid:string,proxyid:string,caller_attr_channel:string,economic_hotel:string,fast_filter_keywords:string,mhotel_ids:string,return_has_xianfu_hotel:string,return_has_yufu_hotel:string,hotel_brand_id:string,only_limitime_sale:string,facility_ids:string,theme_ids:string,star_rates:string,district_id:string,district_type:string,price_pair:string,payment_methods:string,nearby:string,poi_id:string,region_id:string,check_in:string,check_out:string,id:string,executetime:string,keywords:string,setkeywords:string,setbrandid:string,setstarrates:string,inner_search_type:string,hotel_group_id:string,sorting_method:string,setfilterattr:string,mrankflag:string,mranktype:string,setnearby:string,setfastfilter_attr:string,setpoi_id:string,sethotel_group_id:string,response_mhotelids:array<string>,setprice_pair:string,star_ratessize:string,facility_idssize:string,setdistrict_type:string,settheme_ids:string,setdistrict_id:string,settrace_id:string,pageindex:string,pagesize:string,recreqattrtype:string,ifun:string,crawled_flag:string,geo_type:string,activity_flag:string>",
  "rows": 1149130,
  "stripe count": 3,
  "format": "0.12", "writer version": "original",
  "compression": "zlib", "compression block": 262144,
  "file length": 116043721,
  "content": 116041038, "stripe stats": 3599, "footer": 2549, "postscript": 23,
  "row index stride": 10000,
  "user metadata": {
  },
  "stripes": [
    { "stripe": 0, "rows": 575000,
      "offset": 3, "length": 57272824,
      "index": 72291, "data": 57198680, "footer": 1853
    },
    { "stripe": 1, "rows": 510000,
      "offset": 57272827, "length": 51277701,
      "index": 64624, "data": 51211228, "footer": 1849
    },
    { "stripe": 2, "rows": 64130,
      "offset": 108550528, "length": 7490510,
      "index": 12819, "data": 7475943, "footer": 1748
    }
  ]
}
Vimos commented 7 years ago

Using the new branch, I got this.

In [1]: from orcreader import OrcReader
   ...: reader = OrcReader('dt=2017-05-14_os=android_part=000004_0')
   ...: reader.open()
   ...: 

In [2]: reader.fileMetaInfo
Out[2]: {u'metadataSize': u'3599', u'compressionType': u'ZLIB', u'writerVersion': u'0', u'versionLists': u'[0, 12]', u'bufferSize': u'262144'}
nqbao commented 7 years ago

Yeah. The reader by default will use the blockSize from the metadata, which is "compression block": 262144

The possible option is to manually override the blockSize. I will work on this later today.