Closed aucahuasi closed 4 years ago
Here the results of the script
CUDF results (tail 10): 'select l_shipdate from lineitem order by l_orderkey ASC'
300748 1997-06-18 19:00:00
300749 1997-04-27 19:00:00
300750 1997-05-18 19:00:00
300751 1997-06-06 19:00:00
300752 1997-04-28 19:00:00
300753 1997-06-07 19:00:00
300754 1997-06-28 19:00:00
300755 1998-06-19 19:00:00
300756 1993-09-02 19:00:00
300757 1993-07-12 19:00:00
Name: l_shipdate, dtype: datetime64[ns]
-----------------------------
pySpark results (tail 10): 'select l_shipdate from lineitem order by l_orderkey ASC'
300748 1995-08-29 19:00:00
300749 1995-07-06 19:00:00
300750 1995-08-14 19:00:00
300751 1995-06-10 19:00:00
300752 1997-05-22 19:00:00
300753 1997-05-31 19:00:00
300754 1997-07-22 19:00:00
300755 1997-05-07 19:00:00
300756 1998-05-09 19:00:00
300757 1998-04-12 19:00:00
Name: l_shipdate, dtype: datetime64[ns]
I can reproduce comparing vs pyarrow: it matches for the first 300000 out of 300758 rows, but the last rowgroup starting at row 300000 starts with all zero values (which corresponds to 2015-01-01 date in ORC) If I set the use_index parameter to False (default is true), then all the data matches, so it's definitely index-related in the ORC reader (presumably something having to do with the last index entry).
Describe the bug Seems cudf.read_orc is not getting right the timestamp values from a standard TPCH file
Steps/Code to reproduce bug
Download this file https://github.com/aucahuasi/tpch-orc-files/blob/master/lineitem_1_0.orc and install pyspark with
conda install --yes -c conda-forge openjdk=8.0 maven pyspark=2.4.3 pytest
Code:
Expected behavior The cudf ORC reader should have the same results from pySpark.
Environment overview (please complete the following information)
Environment details Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsClick here to see environment details
Additional context Add any other context about the problem here.