rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.44k stars 903 forks source link

[BUG] Bools written by cuIO ORC writer don't match when read by pyarrow/pyorc #6763

Open devavret opened 4 years ago

devavret commented 4 years ago

When writing a large dataframe with bool column using cuIO ORC writer, the result of reading the file back using pyarrow does not match the input dataframe. However when reading back from cudf's ORC reader it matches.

import pandas as pd
import numpy as np
import pyarrow as pa
import pyorc
import cudf

np.random.seed(0)
from cudf._lib.null_mask import bitmask_allocation_size_bytes

def random_bitmask(size):
    sz = bitmask_allocation_size_bytes(size)
    data = np.random.randint(0, 255, dtype="u1", size=sz)
    return data.view("i1")

size = 6000000
arr = np.random.randint(low=0, high=2, size=size).astype(np.bool)
s = cudf.Series.from_masked_array(arr, random_bitmask(size))
gdf = cudf.DataFrame({"col_bool": s})

# write with cuIO
fname = "brokenbool.orc"
gdf.to_orc(fname)

# read with pyarrow
pdf = pa.orc.ORCFile(fname).read().to_pandas()

# the sum doesn't match
print(gdf.col_bool.sum(), pdf.col_bool.sum())

# read with pyorc
file = open(fname, 'rb')
data = pyorc.Reader(file).read()
pdf = pd.DataFrame(data, columns=["col_bool"])

# sum matches pyarrow but not original df
print(gdf.col_bool.sum(), pdf.col_bool.sum())

# reading with cuIO gives the correct result
print(gdf.col_bool.sum(), cudf.read_orc(fname).col_bool.sum())

Note that this doesn't occur when there are no nulls in the input.

vuule commented 3 years ago

Based on the offline discussion with @rgsl888prabhu , this is potentially a Pyarrow issue, as they don't handle the way our writer splits boolean data streams into stripes. Keeping the priority for now.

rgsl888prabhu commented 3 years ago

For the reference, https://issues.apache.org/jira/browse/ARROW-10635

And assumption in cudf ORC writer https://github.com/rapidsai/cudf/blob/01b8b5c5d0735b5a1c1df4e967fc929b337a9926/cpp/src/io/orc/orc.cpp#L210

vuule commented 3 years ago

We got confirmation that the issue also repros with Spark reader, so treating this as a cuIO bug (not Pyarrow bug).

rgsl888prabhu commented 3 years ago

bool_pq.zip

code to reproduce

import cudf
import pandas as pd
df = cudf.read_parquet("bool_pq.parquet")
df.to_orc("broken_bool.orc")
pdf = pd.read_orc("broken_bool.orc")
gdf = cudf.read_orc("broken_bool.orc")
# test pandas and cudf orc read
pdf.dropna()[gdf.dropna().to_pandas()['col_bool'] != pdf.dropna()['col_bool']]
# Compare parquet and orc cudf read
gdf.dropna()[df.dropna().to_pandas()['col_bool'] != gdf.dropna().to_pandas()['col_bool']]
vuule commented 3 years ago

Root cause: ORC encodes bools as bits, where null values are omitted from the data stream. Row groups have 10k elements, so when there are no nulls they fill 1250 bytes completely. When nulls are present, the last byte might be incomplete. Other readers (pyarrow, Spark) expect all bits in encoded column to be valid (with the exception of last byte in the stripe).

Thus, we need to encode bool values from the next row group into the incomplete byte and set the next row group starting offset to the correct bit within the data encoded as part of the current row group. This offsets the encoding of the next row groups and the effect ripples over the entire stripe. Significant changes are needed to the current implementation to be able to support this.

vuule commented 3 years ago

Suggested approach:

  1. Gather row ranges that need to be encoded together (new step/code):
    1. Gather null counts per row group.
    2. Apply thrust::inclusive_scan on the null counts.
    3. Modulo 8 each element to get the number of bits that this row group borrow from the next one (as part of inclusive_scan?).
    4. Find the row number of the last row each row group encoder needs to encode (might be larger than the number of borrowed bits, as nulls are not encoded in the data stream) - thread per row group.
  2. Encode kernel changes:
    1. Pass the information about bool row ranges to the encode kernel.
    2. Encode the bool column rows in the range, instead of rows in the row group.
    3. Save the location of the last batch of encoded (Byte RLE) values .
  3. Row group offset computation: use information about last encoded batch and the number of borrowed bits.
github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.