rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.45k stars 908 forks source link

[BUG] ORC reader incorrect timestamp values (vs pySpark) #4047

Closed aucahuasi closed 4 years ago

aucahuasi commented 4 years ago

Describe the bug Seems cudf.read_orc is not getting right the timestamp values from a standard TPCH file

Steps/Code to reproduce bug

Download this file https://github.com/aucahuasi/tpch-orc-files/blob/master/lineitem_1_0.orc and install pyspark with

conda install --yes -c conda-forge openjdk=8.0 maven pyspark=2.4.3 pytest

Code:

def cudf_orc_reader(tpch_lineitem_orc_file_path):
    import cudf as cudf

    df = cudf.read_orc(tpch_lineitem_orc_file_path)
    df.sort_values(by = "l_orderkey", ascending = True)
    print(df["l_shipdate"].tail(10))

def pyspark_orc_reader(tpch_lineitem_orc_file_path):
    from pyspark.sql import SparkSession

    spark = SparkSession.builder.appName('OrcFileConverter').getOrCreate()
    orc_df = spark.read.orc(tpch_lineitem_orc_file_path)
    orc_df.createOrReplaceTempView("lineitem")
    q='select l_shipdate from lineitem order by l_orderkey ASC'
    query_result = spark.sql(q)
    df = query_result.toPandas()
    print(df["l_shipdate"].tail(10))

def main():
    tpch_lineitem_orc_file_path = '/home/percy/Blazing/TestingData/100Part2/tpch/dir/lineitem/orc/lineitem_1_0.orc'

    print("CUDF results (tail 10): 'select l_shipdate from lineitem order by l_orderkey ASC'")
    cudf_orc_reader(tpch_lineitem_orc_file_path)

    print("\n-----------------------------\n")

    print("pySpark results (tail 10): 'select l_shipdate from lineitem order by l_orderkey ASC'")
    pyspark_orc_reader(tpch_lineitem_orc_file_path)

if __name__ == "__main__" :
    main()

Expected behavior The cudf ORC reader should have the same results from pySpark.

Environment overview (please complete the following information)

Environment details Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Click here to see environment details

     **git***
     commit 68320dd40e1efef11de06bf20d8f419cee115346 (HEAD -> branch-0.13, origin/branch-0.13)
     Merge: 6bed559 702f720
     Author: Jake Hemstad 
     Date:   Mon Feb 3 11:18:31 2020 -0600

     Merge pull request #3880 from karthikeyann/enh-reduction_adapting_aggregation

     [REVIEW] new aggregation infrastructure in column reduction
     **git submodules***
     b165e1fb11eeea64ccf95053e40f2424312599cc thirdparty/cub (v1.7.1)
     bcd545071c7a5ddb28cb6576afc6399eb1286c43 thirdparty/jitify (heads/cudf)
     cdcda484d0c7db114ea29c3b33429de5756ecfd8 thirdparty/libcudacxx (0.8.1-99-gcdcda48)
     a97a7380c76346c22bb67b93695bed19592afad2 thirdparty/libcudacxx/libcxx (heads/rapidsai-interop)

     ***OS Information***
     DISTRIB_ID=Ubuntu
     DISTRIB_RELEASE=16.04
     DISTRIB_CODENAME=xenial
     DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"
     NAME="Ubuntu"
     VERSION="16.04.4 LTS (Xenial Xerus)"
     ID=ubuntu
     ID_LIKE=debian
     PRETTY_NAME="Ubuntu 16.04.4 LTS"
     VERSION_ID="16.04"
     HOME_URL="http://www.ubuntu.com/"
     SUPPORT_URL="http://help.ubuntu.com/"
     BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
     VERSION_CODENAME=xenial
     UBUNTU_CODENAME=xenial
     Linux pctabz 4.15.0-74-generic #83~16.04.1-Ubuntu SMP Wed Dec 18 04:56:23 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

     ***GPU Information***
     Mon Feb  3 15:47:22 2020
     +-----------------------------------------------------------------------------+
     | NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
     |-------------------------------+----------------------+----------------------+
     | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
     |===============================+======================+======================|
     |   0  GeForce GTX 105...  Off  | 00000000:01:00.0 Off |                  N/A |
     | N/A   45C    P0    N/A /  N/A |      0MiB /  4042MiB |      0%      Default |
     +-------------------------------+----------------------+----------------------+

     +-----------------------------------------------------------------------------+
     | Processes:                                                       GPU Memory |
     |  GPU       PID   Type   Process name                             Usage      |
     |=============================================================================|
     |  No running processes found                                                 |
     +-----------------------------------------------------------------------------+

     ***CPU***
     Architecture:          x86_64
     CPU op-mode(s):        32-bit, 64-bit
     Byte Order:            Little Endian
     CPU(s):                8
     On-line CPU(s) list:   0-7
     Thread(s) per core:    2
     Core(s) per socket:    4
     Socket(s):             1
     NUMA node(s):          1
     Vendor ID:             GenuineIntel
     CPU family:            6
     Model:                 158
     Model name:            Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
     Stepping:              9
     CPU MHz:               3473.664
     CPU max MHz:           3800,0000
     CPU min MHz:           800,0000
     BogoMIPS:              5616.00
     Virtualization:        VT-x
     L1d cache:             32K
     L1i cache:             32K
     L2 cache:              256K
     L3 cache:              6144K
     NUMA node0 CPU(s):     0-7
     Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d

     ***CMake***
     /home/percy/Applications/anaconda/conda/envs/new-conda-env-0-13/bin/cmake
     cmake version 3.16.3

     CMake suite maintained and supported by Kitware (kitware.com/cmake).

     ***g++***
     /usr/bin/g++
     g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
     Copyright (C) 2015 Free Software Foundation, Inc.
     This is free software; see the source for copying conditions.  There is NO
     warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

     ***nvcc***

     ***Python***
     /home/percy/Applications/anaconda/conda/envs/new-conda-env-0-13/bin/python
     Python 3.7.6

     ***Environment Variables***
     PATH                            : /home/percy/Applications/anaconda/conda/envs/new-conda-env-0-13/bin:/home/percy/Applications/gcloud/google-cloud-sdk/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/home/percy/Applications/docker-compose/current:/home/percy/Applications/kubectl:/home/percy/Applications/minikube:/home/percy/Applications/ctop:/home/percy/Applications/anaconda/conda/bin
     LD_LIBRARY_PATH                 : :/home/percy/Applications/anaconda/conda/envs/new-conda-env-0-13/lib
     NUMBAPRO_NVVM                   :
     NUMBAPRO_LIBDEVICE              :
     CONDA_PREFIX                    : /home/percy/Applications/anaconda/conda/envs/new-conda-env-0-13
     PYTHON_PATH                     :

     ***conda packages***
     /home/percy/Applications/anaconda/conda/bin/conda
     # packages in environment at /home/percy/Applications/anaconda/conda/envs/new-conda-env-0-13:
     #
     # Name                    Version                   Build  Channel
     _libgcc_mutex             0.1                        main
     arrow-cpp                 0.15.0           py37h090bef1_2    conda-forge
     attrs                     19.3.0                     py_0    conda-forge
     blazingsql                0.6                       dev_0    
     bokeh                     1.4.0                    py37_0    conda-forge
     boost-cpp                 1.70.0               h8e57a91_2    conda-forge
     brotli                    1.0.7             he1b5a44_1000    conda-forge
     bsql-engine               0.6                      pypi_0    pypi
     bsql-rapids-thirdparty    0.12.0a                       0    blazingsql-nightly
     bsql-toolchain            0.12.0a                       0    blazingsql-nightly
     bsql-toolchain-aws-cpp    0.12.0a                       0    blazingsql-nightly
     bsql-toolchain-gcp-cpp    0.12.0a                       0    blazingsql-nightly
     bzip2                     1.0.8                h516909a_2    conda-forge
     c-ares                    1.15.0            h516909a_1001    conda-forge
     ca-certificates           2019.11.28           hecc5488_0    conda-forge
     certifi                   2019.11.28               py37_0    conda-forge
     cffi                      1.13.2           py37h8022711_0    conda-forge
     chardet                   3.0.4                 py37_1003    conda-forge
     click                     7.0                        py_0    conda-forge
     cloudpickle               1.2.2                      py_1    conda-forge
     cmake                     3.16.3               h28c56e5_0    conda-forge
     cppzmq                    4.4.1                hc9558a2_0    conda-forge
     cryptography              2.8              py37h72c5cf5_1    conda-forge
     cudatoolkit               10.0.130                      0    nvidia
     cudf                      0.13.0a200203          py37_962    rapidsai-nightly
     cudnn                     7.6.0                cuda10.0_0    nvidia
     cupy                      7.1.1            py37he57b8b9_1    conda-forge
     curl                      7.65.3               hf8cf82a_0    conda-forge
     cyrus-sasl                2.1.27               he38ecfd_0    conda-forge
     cython                    0.29.14          py37he1b5a44_0    conda-forge
     cytoolz                   0.10.1           py37h516909a_0    conda-forge
     dask                      2.10.1                     py_0    conda-forge
     dask-core                 2.10.1                     py_0    conda-forge
     dask-cuda                 0.13.0a200203           py37_37    rapidsai-nightly
     dask-cudf                 0.13.0a200203          py37_962    rapidsai-nightly
     distributed               2.10.0                     py_0    conda-forge
     dlpack                    0.2                  he1b5a44_1    conda-forge
     double-conversion         3.1.5                he1b5a44_2    conda-forge
     et-xmlfile                1.0.1                    pypi_0    pypi
     expat                     2.2.9                he1b5a44_2    conda-forge
     fastavro                  0.22.9           py37h516909a_0    conda-forge
     fastrlock                 0.4             py37he1b5a44_1000    conda-forge
     freetype                  2.10.0               he983fc9_1    conda-forge
     fsspec                    0.6.2                      py_0    conda-forge
     future                    0.18.2                   py37_0    conda-forge
     gflags                    2.2.2             he1b5a44_1002    conda-forge
     gitdb2                    2.0.6                    pypi_0    pypi
     gitpython                 3.0.5                    pypi_0    pypi
     glog                      0.4.0                he1b5a44_1    conda-forge
     gmock                     1.10.0                        1    conda-forge
     grpc-cpp                  1.23.0               h18db393_0    conda-forge
     gtest                     1.10.0               hc9558a2_1    conda-forge
     heapdict                  1.0.1                      py_0    conda-forge
     icu                       64.2                 he1b5a44_1    conda-forge
     idna                      2.8                   py37_1000    conda-forge
     importlib_metadata        1.5.0                    py37_0    conda-forge
     inflect                   4.0.0                    py37_1    conda-forge
     jaraco.itertools          5.0.0                      py_0    conda-forge
     jdcal                     1.4.1                    pypi_0    pypi
     jinja2                    2.11.1                     py_0    conda-forge
     jpeg                      9c                h14c3975_1001    conda-forge
     jpype1                    0.7              py37h9de70de_0    conda-forge
     krb5                      1.16.4               h173b8e3_0
     ld_impl_linux-64          2.33.1               h53a641e_8    conda-forge
     libblas                   3.8.0               14_openblas    conda-forge
     libcblas                  3.8.0               14_openblas    conda-forge
     libcudf                   0.13.0a200203      cuda10.0_962    rapidsai-nightly
     libcurl                   7.65.3               hda55be3_0    conda-forge
     libedit                   3.1.20181209         hc058e9b_0
     libevent                  2.1.10               h72c5cf5_0    conda-forge
     libffi                    3.2.1                hd88cf55_4
     libgcc-ng                 9.1.0                hdf63c60_0
     libgfortran-ng            7.3.0                hdf63c60_5    conda-forge
     liblapack                 3.8.0               14_openblas    conda-forge
     libllvm8                  8.0.1                hc9558a2_0    conda-forge
     libntlm                   1.4               h14c3975_1002    conda-forge
     libnvstrings              0.13.0a200203      cuda10.0_962    rapidsai-nightly
     libopenblas               0.3.7                h5ec1e0e_6    conda-forge
     libpng                    1.6.37               hed695b0_0    conda-forge
     libprotobuf               3.8.0                h8b12597_0    conda-forge
     librmm                    0.13.0a200203      cuda10.0_104    rapidsai-nightly
     libsodium                 1.0.17               h516909a_0    conda-forge
     libssh2                   1.8.2                h22169c7_2    conda-forge
     libstdcxx-ng              9.1.0                hdf63c60_0
     libtiff                   4.1.0                hfc65ed5_0    conda-forge
     libuv                     1.34.0               h516909a_0    conda-forge
     llvmlite                  0.31.0           py37h8b12597_0    conda-forge
     locket                    0.2.0                      py_2    conda-forge
     lz4-c                     1.8.3             he1b5a44_1001    conda-forge
     markupsafe                1.1.1            py37h516909a_0    conda-forge
     maven                     3.6.0                         0    conda-forge
     more-itertools            8.2.0                      py_0    conda-forge
     msgpack-python            0.6.2            py37hc9558a2_0    conda-forge
     nccl                      2.4.6.1              cuda10.0_0    nvidia
     ncurses                   6.1                  he6710b0_1
     netifaces                 0.10.9          py37h516909a_1000    conda-forge
     numba                     0.48.0           py37hb3f55d8_0    conda-forge
     numpy                     1.17.5           py37h95a1406_0    conda-forge
     nvstrings                 0.13.0a200203          py37_962    rapidsai-nightly
     olefile                   0.46                       py_0    conda-forge
     openjdk                   8.0.192           h516909a_1004    conda-forge
     openpyxl                  3.0.3                    pypi_0    pypi
     openssl                   1.1.1d               h516909a_0    conda-forge
     packaging                 20.1                       py_0    conda-forge
     pandas                    0.25.3           py37hb3f55d8_0    conda-forge
     parquet-cpp               1.5.1                         2    conda-forge
     partd                     1.1.0                      py_0    conda-forge
     pillow                    5.3.0           py37h00a061d_1000    conda-forge
     pip                       20.0.2                   py37_1
     pluggy                    0.13.0                   py37_0    conda-forge
     psutil                    5.6.7            py37h516909a_0    conda-forge
     py                        1.8.1                      py_0    conda-forge
     py4j                      0.10.7                     py_1    conda-forge
     pyarrow                   0.15.0           py37h8b68381_1    conda-forge
     pycparser                 2.19                     py37_1    conda-forge
     pydrill                   0.3.4                    pypi_0    pypi
     pyhive                    0.6.1                    py37_0
     pymysql                   0.9.3                    pypi_0    pypi
     pynvml                    8.0.4                      py_0    conda-forge
     pyopenssl                 19.1.0                   py37_0    conda-forge
     pyparsing                 2.4.6                      py_0    conda-forge
     pysocks                   1.7.1                    py37_0    conda-forge
     pyspark                   2.4.3                      py_0    conda-forge
     pytest                    5.3.5                    py37_0    conda-forge
     python                    3.7.6                h0371630_2
     python-dateutil           2.8.1                      py_0    conda-forge
     pytz                      2019.3                     py_0    conda-forge
     pyyaml                    5.3              py37h516909a_0    conda-forge
     rapidjson                 1.1.0             he1b5a44_1002    conda-forge
     re2                       2020.01.01           he1b5a44_0    conda-forge
     readline                  7.0                  h7b6447c_5
     requests                  2.22.0                   py37_1    conda-forge
     rhash                     1.3.6             h14c3975_1001    conda-forge
     rmm                       0.13.0a200203          py37_104    rapidsai-nightly
     sasl                      0.2.1           py37he1b5a44_1001    conda-forge
     setuptools                45.1.0                   py37_0
     six                       1.14.0                   py37_0    conda-forge
     smmap2                    2.0.5                    pypi_0    pypi
     snappy                    1.1.7             he1b5a44_1003    conda-forge
     sortedcontainers          2.1.0                      py_0    conda-forge
     sqlalchemy                1.3.13           py37h516909a_0    conda-forge
     sqlite                    3.30.1               h7b6447c_0
     tblib                     1.6.0                      py_0    conda-forge
     thrift                    0.11.0          py37he1b5a44_1001    conda-forge
     thrift-cpp                0.12.0            hf3afdfd_1004    conda-forge
     thrift_sasl               0.3.0           py37h516909a_1001    conda-forge
     tk                        8.6.8                hbc83047_0
     toolz                     0.10.0                     py_0    conda-forge
     tornado                   6.0.3            py37h516909a_0    conda-forge
     uriparser                 0.9.3                he1b5a44_1    conda-forge
     urllib3                   1.25.7                   py37_0    conda-forge
     wcwidth                   0.1.8                      py_0    conda-forge
     wheel                     0.34.1                   py37_0
     xz                        5.2.4                h14c3975_4
     yaml                      0.2.2                h516909a_1    conda-forge
     zeromq                    4.3.2                he1b5a44_2    conda-forge
     zict                      1.0.0                      py_0    conda-forge
     zipp                      2.1.0                      py_0    conda-forge
     zlib                      1.2.11               h7b6447c_3
     zstd                      1.4.3                h3b9ef0a_0    conda-forge

Additional context Add any other context about the problem here.

aucahuasi commented 4 years ago

Here the results of the script

CUDF results (tail 10): 'select l_shipdate from lineitem order by l_orderkey ASC'
300748   1997-06-18 19:00:00
300749   1997-04-27 19:00:00
300750   1997-05-18 19:00:00
300751   1997-06-06 19:00:00
300752   1997-04-28 19:00:00
300753   1997-06-07 19:00:00
300754   1997-06-28 19:00:00
300755   1998-06-19 19:00:00
300756   1993-09-02 19:00:00
300757   1993-07-12 19:00:00
Name: l_shipdate, dtype: datetime64[ns]

-----------------------------

pySpark results (tail 10): 'select l_shipdate from lineitem order by l_orderkey ASC'
300748   1995-08-29 19:00:00
300749   1995-07-06 19:00:00
300750   1995-08-14 19:00:00
300751   1995-06-10 19:00:00
300752   1997-05-22 19:00:00
300753   1997-05-31 19:00:00
300754   1997-07-22 19:00:00
300755   1997-05-07 19:00:00
300756   1998-05-09 19:00:00
300757   1998-04-12 19:00:00
Name: l_shipdate, dtype: datetime64[ns]
OlivierNV commented 4 years ago

I can reproduce comparing vs pyarrow: it matches for the first 300000 out of 300758 rows, but the last rowgroup starting at row 300000 starts with all zero values (which corresponds to 2015-01-01 date in ORC) If I set the use_index parameter to False (default is true), then all the data matches, so it's definitely index-related in the ORC reader (presumably something having to do with the last index entry).