jcrobak / parquet-python

python implementation of the parquet columnar file format.
Apache License 2.0
340 stars 257 forks source link

parquet file has null value cause traceback #52

Closed jinlianch closed 7 years ago

jinlianch commented 7 years ago

When I try to read data from a parquet file which contains null value for some key, I got below error.

(most recent call last): File "tt.py", line 12, in for r in parquet.DictReader(fo): File "/usr/local/lib/python2.7/site-packages/parquet/init.py", line 420, in DictReader for row in reader(fo, columns): File "/usr/local/lib/python2.7/site-packages/parquet/init.py", line 467, in reader dict_items) File "/usr/local/lib/python2.7/site-packages/parquet/init.py", line 380, in read_data_page dict_values_io_obj, bit_width, len(dict_values_bytes)) File "/usr/local/lib/python2.7/site-packages/parquet/encoding.py", line 227, in read_rle_bit_packed_hybrid res += read_bitpacked(io_obj, header, width, debug_logging) File "/usr/local/lib/python2.7/site-packages/parquet/encoding.py", line 146, in read_bitpacked b = raw_bytes[current_byte] IndexError: list index out of range

jcrobak commented 7 years ago

Hi @jinlianch, this should work and is a bug. Are you able to share a file with which I can reproduce the issue?

Thanks!

jinlianch commented 7 years ago

part-00000-aac1e753-02f7-447e-bbda-d80626611b39.snappy.parquet.zip

Test code: import parquet with open('part-00000-aac1e753-02f7-447e-bbda-d80626611b39.snappy.parquet', 'r') as fo: for r in parquet.DictReader(fo): print (json.dumps(r))

jinlianch commented 7 years ago

I test other file it work, don't know if the file has something wrong, but the file works on spark

jcrobak commented 7 years ago

I'm not able to reproduce this with the latest release. Please let me know if you're still able to reproduce.