Closed galipremsagar closed 4 years ago
Got a minimal repro (AFAICT):
num_rows = 513
df = cudf.DataFrame({'a':[None]*num_rows}, dtype='int32')
df['a'][0] = 1024*1024*1024
df['a'][num_rows-1] = 1
df.to_orc('temp.orc')
new_df = cudf.read_orc('temp.orc')
assert_eq(df, new_df)
If the first element is smaller than 2^30, no repro. If there are no more than 512 elements, no repro. Location of the 2^30 value does not change the outcome (output is different, but incorrect). Value at index 512 does not affect the output at all.
Interestingly, I'm getting data corruption even when reading the minimal repro file with pyarrow. It's different from cudf output, but still not correct. Have to dig into how the ints are encoded, this looks like a writer bug at this point.
Describe the bug There seems to be a data corruption while reading an int column from
cudf.read_orc
.Steps/Code to reproduce bug parquet file(compressed to zip for github attachment reasons) : temp-int32.zip
Expected behavior The values should be preserved correctly.
Environment overview (please complete the following information)
Environment details Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsClick here to see environment details
Additional context Surfaced while running fuzz tests: #6001