noirello / pyorc

Python module for Apache ORC file format
Apache License 2.0
64 stars 21 forks source link

Segmentation fault caused on reading files [negative scenario] #24

Closed pyjauser closed 3 years ago

pyjauser commented 3 years ago

Hi,

I tried few different python related ORC packages and found pyorc is best. I was testing pyorc for both positive scenario and negative scenario listed from apache orc project. For negative scenario files I was expecting appropriate exception message but pyorc causes segmentation fault, Could you please look into this?

import pyorc

with open("./missing_blob_stream_in_string_dict.orc", "rb") as data:
    reader = pyorc.Reader(data)
    for row in reader:
        print(row)

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

These are the below files which causes segmentation fault when read by pyorc. Some details taken from apache ORC project which may help:

missing_blob_stream_in_string_dict.orc => ORC-591 => [C++] Check missing blob stream for StringDictionaryColumnRe…

missing_length_stream_in_string_dict.orc => ORC-590 => [C++] added check for missing LENGTH stream in StringDiction…

negative_dict_entry_lengths.orc => ORC-589 => [C++] add checks about negative dictionary entry lengths

stripe_footer_bad_column_encodings.orc => ORC-580 => [C++] Verify ColumnEncodings in StripeFooter (#463)

noirello commented 3 years ago

Hi,

It looks like all of these fixes will be in the upcoming 1.7.0 version of the ORC C++ Core. (Even the ORC-589, where the ticket says it's fixed in 1.6.3, but it's not patched in the latest 1.6.5)

pyjauser commented 3 years ago

Thank you for looking into this. After building pyorc with latest apache's orc-master branch, expected exceptions were raised for mentioned files.