jcrobak / parquet-python

python implementation of the parquet columnar file format.
Apache License 2.0
340 stars 257 forks source link

error: bad char in struct format in read_plain_int96 #49

Closed mdemoss closed 7 years ago

mdemoss commented 7 years ago

"<qi" * count produces something like <qi<qi<qi

The docs indicate that the first character of the format string can be used to indicate the byte order, size and alignment.

I've tested potential fixes for this but I suspect the results may be incorrect because the values aren't what I expected. I tried calling an old version of read_plain_int96 multiple times, but that didn't produce values I expected either.

Does anybody have a good test case for this or know what the format ought to be?

(most recent call last):
  File "/usr/lib64/python2.7/pdb.py", line 1314, in main
    pdb._runscript(mainpyfile)
  File "/usr/lib64/python2.7/pdb.py", line 1233, in _runscript
    self.run(statement)
  File "/usr/lib64/python2.7/bdb.py", line 400, in run
    exec cmd in globals, locals
  File "<string>", line 1, in <module>
  File "transformParquet.py", line 1, in <module>
    import parquet
  File "/home/ec2-user/poll-pull-transform-parquet/pptp/local/lib/python2.7/site-packages/parquet/__init__.py", line 379, in DictReader
    for row in reader(fo, columns):
  File "/home/ec2-user/poll-pull-transform-parquet/pptp/local/lib/python2.7/site-packages/parquet/__init__.py", line 433, in reader
    dict_items = read_dictionary_page(fo, ph, cmd)
  File "/home/ec2-user/poll-pull-transform-parquet/pptp/local/lib/python2.7/site-packages/parquet/__init__.py", line 359, in read_dictionary_page
    page_header.dictionary_page_header.num_values)
  File "/home/ec2-user/poll-pull-transform-parquet/pptp/local/lib/python2.7/site-packages/parquet/encoding.py", line 88, in read_plain
    return conv(fo, count)
  File "/home/ec2-user/poll-pull-transform-parquet/pptp/local/lib/python2.7/site-packages/parquet/encoding.py", line 46, in read_plain_int96
    items = struct.unpack("<qi" * count, fo.read(12) * count)
error: bad char in struct format
jcrobak commented 7 years ago

Thanks for the report. Looks like there was more than one bug in that parsing code. Putting together a fix now.

jcrobak commented 7 years ago

Should be fixed by #50 . Please reopen if you're still seeing issues!