Open devavret opened 4 years ago
Based on the offline discussion with @rgsl888prabhu , this is potentially a Pyarrow issue, as they don't handle the way our writer splits boolean data streams into stripes. Keeping the priority for now.
For the reference, https://issues.apache.org/jira/browse/ARROW-10635
And assumption in cudf ORC writer https://github.com/rapidsai/cudf/blob/01b8b5c5d0735b5a1c1df4e967fc929b337a9926/cpp/src/io/orc/orc.cpp#L210
We got confirmation that the issue also repros with Spark reader, so treating this as a cuIO bug (not Pyarrow bug).
code to reproduce
import cudf
import pandas as pd
df = cudf.read_parquet("bool_pq.parquet")
df.to_orc("broken_bool.orc")
pdf = pd.read_orc("broken_bool.orc")
gdf = cudf.read_orc("broken_bool.orc")
# test pandas and cudf orc read
pdf.dropna()[gdf.dropna().to_pandas()['col_bool'] != pdf.dropna()['col_bool']]
# Compare parquet and orc cudf read
gdf.dropna()[df.dropna().to_pandas()['col_bool'] != gdf.dropna().to_pandas()['col_bool']]
Root cause: ORC encodes bools as bits, where null values are omitted from the data stream. Row groups have 10k elements, so when there are no nulls they fill 1250 bytes completely. When nulls are present, the last byte might be incomplete. Other readers (pyarrow, Spark) expect all bits in encoded column to be valid (with the exception of last byte in the stripe).
Thus, we need to encode bool values from the next row group into the incomplete byte and set the next row group starting offset to the correct bit within the data encoded as part of the current row group. This offsets the encoding of the next row groups and the effect ripples over the entire stripe. Significant changes are needed to the current implementation to be able to support this.
Suggested approach:
thrust::inclusive_scan
on the null counts.inclusive_scan
?).This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
When writing a large dataframe with bool column using cuIO ORC writer, the result of reading the file back using pyarrow does not match the input dataframe. However when reading back from cudf's ORC reader it matches.
Note that this doesn't occur when there are no nulls in the input.