rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.24k stars 884 forks source link

[BUG] Unable to write `timedelta64[s]` type correctly with parquet writer #13409

Closed galipremsagar closed 2 months ago

galipremsagar commented 1 year ago

Describe the bug Only when we have timedelta64[s] dtype for a column, the parquet writer seems to be writing it as a timedelta64[ms] column which is leading both cudf & pyarrow parquet readers to pickup the column type incorrectly.

Steps/Code to reproduce bug Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

In [1]: import cudf

In [3]: df = cudf.DataFrame({"seconds": cudf.Series([1234, 3456, 32442], dtype='timedelta64[s]')})

In [4]: df
Out[4]: 
          seconds
0 0 days 00:20:34
1 0 days 00:57:36
2 0 days 09:00:42

In [5]: df.dtypes
Out[5]: 
seconds    timedelta64[s]
dtype: object

In [6]: df.to_parquet("a")

In [7]: cudf.read_parquet("a")
Out[7]: 
          seconds
0 0 days 00:20:34
1 0 days 00:57:36
2 0 days 09:00:42

In [8]: cudf.read_parquet("a").dtypes
Out[8]: 
seconds    timedelta64[ms]               # Should be timedelta64[s]
dtype: object

In [9]: import pyarrow as pa

In [10]: pa.parquet.read_table("a")
Out[10]: 
pyarrow.Table
seconds: time32[ms] not null           # Should be time32[s]
----
seconds: [[00:20:34.000,00:57:36.000,09:00:42.000]]

# If we now try to write & read using pyarrow the dtype stays intact:

In [11]: pa_table = df.to_arrow()

In [12]: pa_table
Out[12]: 
pyarrow.Table
seconds: duration[s]
----
seconds: [[1234,3456,32442]]

In [13]: pa.parquet.write_table(pa_table, "a")

In [15]: pa.parquet.read_table("a")
Out[15]: 
pyarrow.Table
seconds: duration[s]
----
seconds: [[1234,3456,32442]]

In [17]: import pandas as pd

In [18]: pd.read_parquet("a")
Out[18]: 
          seconds
0 0 days 00:20:34
1 0 days 00:57:36
2 0 days 09:00:42

In [19]: pd.read_parquet("a").dtypes
Out[19]: 
seconds    timedelta64[ns]
dtype: object

In [21]: pa.parquet.read_metadata("a").schema
Out[21]: 
<pyarrow._parquet.ParquetSchema object at 0x7fc8e622b0c0>
required group field_id=-1 schema {
  optional int64 field_id=-1 seconds;
}

Expected behavior We are writing all other timedelta resolutions(ns, ms, us) correctly. It's a problem only being seen with s. We should be able to round-trip this type correctly if writer can correctly write this type.

Environment overview (please complete the following information)

Environment details Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Click here to see environment details

     **git***
     commit 9b1496df64b9ae9bd7b44a30cfaa42a2f7e2db3f (HEAD -> branch-23.06)
     Author: Ashwin Srinath <3190405+shwina@users.noreply.github.com>
     Date:   Mon May 22 13:52:36 2023 -0400

     Fix groupby head/tail for empty dataframe (#13398)

     Closes #13397

     Authors:
     - Ashwin Srinath (https://github.com/shwina)

     Approvers:
     - GALI PREM SAGAR (https://github.com/galipremsagar)
     - Bradley Dice (https://github.com/bdice)

     URL: https://github.com/rapidsai/cudf/pull/13398
     **git submodules***

     ***OS Information***
     DISTRIB_ID=Ubuntu
     DISTRIB_RELEASE=18.04
     DISTRIB_CODENAME=bionic
     DISTRIB_DESCRIPTION="Ubuntu 18.04.4 LTS"
     NAME="Ubuntu"
     VERSION="18.04.4 LTS (Bionic Beaver)"
     ID=ubuntu
     ID_LIKE=debian
     PRETTY_NAME="Ubuntu 18.04.4 LTS"
     VERSION_ID="18.04"
     HOME_URL="https://www.ubuntu.com/"
     SUPPORT_URL="https://help.ubuntu.com/"
     BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
     PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
     VERSION_CODENAME=bionic
     UBUNTU_CODENAME=bionic
     Linux dt07 4.15.0-76-generic #86-Ubuntu SMP Fri Jan 17 17:24:28 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

     ***GPU Information***
     Mon May 22 13:53:56 2023
     +---------------------------------------------------------------------------------------+
     | NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
     |-----------------------------------------+----------------------+----------------------+
     | GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
     | Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
     |                                         |                      |               MIG M. |
     |=========================================+======================+======================|
     |   0  Tesla T4                        On | 00000000:3B:00.0 Off |                    0 |
     | N/A   45C    P8               10W /  70W|      2MiB / 15360MiB |      0%      Default |
     |                                         |                      |                  N/A |
     +-----------------------------------------+----------------------+----------------------+
     |   1  Tesla T4                        On | 00000000:5E:00.0 Off |                    0 |
     | N/A   34C    P8                9W /  70W|      2MiB / 15360MiB |      0%      Default |
     |                                         |                      |                  N/A |
     +-----------------------------------------+----------------------+----------------------+
     |   2  Tesla T4                        On | 00000000:AF:00.0 Off |                    0 |
     | N/A   29C    P8               10W /  70W|      2MiB / 15360MiB |      0%      Default |
     |                                         |                      |                  N/A |
     +-----------------------------------------+----------------------+----------------------+
     |   3  Tesla T4                        On | 00000000:D8:00.0 Off |                    0 |
     | N/A   29C    P8               10W /  70W|      2MiB / 15360MiB |      0%      Default |
     |                                         |                      |                  N/A |
     +-----------------------------------------+----------------------+----------------------+

     +---------------------------------------------------------------------------------------+
     | Processes:                                                                            |
     |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
     |        ID   ID                                                             Usage      |
     |=======================================================================================|
     |  No running processes found                                                           |
     +---------------------------------------------------------------------------------------+

     ***CPU***
     Architecture:        x86_64
     CPU op-mode(s):      32-bit, 64-bit
     Byte Order:          Little Endian
     CPU(s):              64
     On-line CPU(s) list: 0-63
     Thread(s) per core:  2
     Core(s) per socket:  16
     Socket(s):           2
     NUMA node(s):        2
     Vendor ID:           GenuineIntel
     CPU family:          6
     Model:               85
     Model name:          Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
     Stepping:            4
     CPU MHz:             1412.660
     BogoMIPS:            4200.00
     Virtualization:      VT-x
     L1d cache:           32K
     L1i cache:           32K
     L2 cache:            1024K
     L3 cache:            22528K
     NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62
     NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63
     Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d arch_capabilities

     ***CMake***
     /nvme/0/pgali/envs/cudfdev/bin/cmake
     cmake version 3.26.4

     CMake suite maintained and supported by Kitware (kitware.com/cmake).

     ***g++***
     /nvme/0/pgali/envs/cudfdev/bin/g++
     g++ (conda-forge gcc 11.3.0-19) 11.3.0
     Copyright (C) 2021 Free Software Foundation, Inc.
     This is free software; see the source for copying conditions.  There is NO
     warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

     ***nvcc***
     /nvme/0/pgali/envs/cudfdev/bin/nvcc
     nvcc: NVIDIA (R) Cuda compiler driver
     Copyright (c) 2005-2022 NVIDIA Corporation
     Built on Wed_Sep_21_10:33:58_PDT_2022
     Cuda compilation tools, release 11.8, V11.8.89
     Build cuda_11.8.r11.8/compiler.31833905_0

     ***Python***
     /nvme/0/pgali/envs/cudfdev/bin/python
     Python 3.10.11

     ***Environment Variables***
     PATH                            : /nvme/0/pgali/envs/cudfdev/bin:/nvme/0/pgali/envs/cudfdev/bin:/nvme/0/pgali/.cargo/bin:/home/nfs/pgali/.vscode-server/bin/b3e4e68a0bc097f0ae7907b217c1119af9e03435/bin/remote-cli:/nvme/0/pgali/.cargo/bin:/nvme/0/pgali/anaconda3/bin:/nvme/0/pgali/anaconda3/condabin:/nvme/0/pgali/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin
     LD_LIBRARY_PATH                 : /usr/local/cuda/lib64::/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
     NUMBAPRO_NVVM                   :
     NUMBAPRO_LIBDEVICE              :
     CONDA_PREFIX                    : /nvme/0/pgali/envs/cudfdev
     PYTHON_PATH                     :

     ***conda packages***
     /nvme/0/pgali/anaconda3/bin/conda
     # packages in environment at /nvme/0/pgali/envs/cudfdev:
     #
     # Name                    Version                   Build  Channel
     _libgcc_mutex             0.1                 conda_forge    conda-forge
     _openmp_mutex             4.5                       2_gnu    conda-forge
     _sysroot_linux-64_curr_repodata_hack 3                   h69a702a_13    conda-forge
     accessible-pygments       0.0.4              pyhd8ed1ab_0    conda-forge
     aiobotocore               2.5.0              pyhd8ed1ab_0    conda-forge
     aiohttp                   3.8.4           py310h1fa729e_0    conda-forge
     aioitertools              0.11.0             pyhd8ed1ab_0    conda-forge
     aiosignal                 1.3.1              pyhd8ed1ab_0    conda-forge
     alabaster                 0.7.13             pyhd8ed1ab_0    conda-forge
     anyio                     3.6.2              pyhd8ed1ab_0    conda-forge
     argon2-cffi               21.3.0             pyhd8ed1ab_0    conda-forge
     argon2-cffi-bindings      21.2.0          py310h5764c6d_3    conda-forge
     arrow-cpp                 11.0.0          ha770c72_20_cpu    conda-forge
     asttokens                 2.2.1              pyhd8ed1ab_0    conda-forge
     async-timeout             4.0.2              pyhd8ed1ab_0    conda-forge
     attrs                     23.1.0             pyh71513ae_1    conda-forge
     aws-c-auth                0.6.27               he072965_1    conda-forge
     aws-c-cal                 0.5.26               hf677bf3_1    conda-forge
     aws-c-common              0.8.19               hd590300_0    conda-forge
     aws-c-compression         0.2.16               hbad4bc6_7    conda-forge
     aws-c-event-stream        0.2.20               hb4b372c_7    conda-forge
     aws-c-http                0.7.7                h2632f9a_4    conda-forge
     aws-c-io                  0.13.21              h9fef7b8_5    conda-forge
     aws-c-mqtt                0.8.11               h2282364_1    conda-forge
     aws-c-s3                  0.3.0                hcb5a9b2_2    conda-forge
     aws-c-sdkutils            0.1.9                hbad4bc6_2    conda-forge
     aws-checksums             0.1.14               hbad4bc6_7    conda-forge
     aws-crt-cpp               0.20.1               he0fdcb3_3    conda-forge
     aws-sam-translator        1.55.0             pyhd8ed1ab_0    conda-forge
     aws-sdk-cpp               1.10.57             hb0b1f3a_12    conda-forge
     aws-xray-sdk              2.12.0             pyhd8ed1ab_0    conda-forge
     babel                     2.12.1             pyhd8ed1ab_1    conda-forge
     backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
     backports                 1.0                pyhd8ed1ab_3    conda-forge
     backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
     backports.zoneinfo        0.2.1           py310hff52083_7    conda-forge
     bcrypt                    3.2.2           py310h5764c6d_1    conda-forge
     beautifulsoup4            4.12.2             pyha770c72_0    conda-forge
     binutils                  2.39                 hdd6e379_1    conda-forge
     binutils_impl_linux-64    2.39                 he00db2b_1    conda-forge
     binutils_linux-64         2.39                h5fc0e48_13    conda-forge
     blas                      1.0                         mkl    conda-forge
     bleach                    6.0.0              pyhd8ed1ab_0    conda-forge
     blinker                   1.6.2              pyhd8ed1ab_0    conda-forge
     bokeh                     2.4.3              pyhd8ed1ab_3    conda-forge
     boto3                     1.26.76            pyhd8ed1ab_0    conda-forge
     botocore                  1.29.76            pyhd8ed1ab_0    conda-forge
     brotlipy                  0.7.0           py310h5764c6d_1005    conda-forge
     bzip2                     1.0.8                h7f98852_4    conda-forge
     c-ares                    1.19.0               hd590300_0    conda-forge
     c-compiler                1.5.2                h0b41bf4_0    conda-forge
     ca-certificates           2023.5.7             hbcca054_0    conda-forge
     cachetools                5.3.0              pyhd8ed1ab_0    conda-forge
     certifi                   2023.5.7           pyhd8ed1ab_0    conda-forge
     cffi                      1.15.1          py310h255011f_3    conda-forge
     cfgv                      3.3.1              pyhd8ed1ab_0    conda-forge
     cfn-lint                  0.75.1             pyhd8ed1ab_0    conda-forge
     charset-normalizer        2.1.1              pyhd8ed1ab_0    conda-forge
     click                     8.1.3           unix_pyhd8ed1ab_2    conda-forge
     cloudpickle               2.2.1              pyhd8ed1ab_0    conda-forge
     cmake                     3.26.4               hcfe8598_0    conda-forge
     colorama                  0.4.6              pyhd8ed1ab_0    conda-forge
     comm                      0.1.3              pyhd8ed1ab_0    conda-forge
     commonmark                0.9.1                      py_0    conda-forge
     coverage                  7.2.5           py310h2372a71_0    conda-forge
     cryptography              40.0.2          py310h34c0648_0    conda-forge
     cubinlinker               0.2.2           py310hf09951c_0    rapidsai
     cuda-python               11.8.1          py310h01a121a_2    conda-forge
     cuda-sanitizer-api        11.8.86                       0    nvidia
     cudatoolkit               11.8.0              h37601d7_11    conda-forge
     cudf                      23.6.0                   pypi_0    pypi
     cupy                      12.0.0          py310h9216885_1    conda-forge
     cxx-compiler              1.5.2                hf52228f_0    conda-forge
     cyrus-sasl                2.1.27               h9033bb2_6    conda-forge
     cython                    0.29.34         py310heca2aa9_0    conda-forge
     cytoolz                   0.12.0          py310h5764c6d_1    conda-forge
     dask                      2023.3.2           pyhd8ed1ab_0    conda-forge
     dask-core                 2023.3.2           pyhd8ed1ab_0    conda-forge
     dask-cuda                 23.06.00a       py310_230522_gcf6e9fb_24    rapidsai-nightly
     dask-cudf                 23.6.0                   pypi_0    pypi
     dataclasses               0.8                pyhc8e2a94_3    conda-forge
     datasets                  2.12.0             pyhd8ed1ab_0    conda-forge
     debugpy                   1.6.7           py310heca2aa9_0    conda-forge
     decopatch                 1.4.10             pyhd8ed1ab_0    conda-forge
     decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
     defusedxml                0.7.1              pyhd8ed1ab_0    conda-forge
     dill                      0.3.6              pyhd8ed1ab_1    conda-forge
     distlib                   0.3.6              pyhd8ed1ab_0    conda-forge
     distributed               2023.3.2.1         pyhd8ed1ab_0    conda-forge
     distro                    1.8.0              pyhd8ed1ab_0    conda-forge
     dlpack                    0.5                  h9c3ff4c_0    conda-forge
     docker-py                 6.1.0              pyhd8ed1ab_0    conda-forge
     docutils                  0.19            py310hff52083_1    conda-forge
     doxygen                   1.8.20               had0d8f1_0    conda-forge
     ecdsa                     0.18.0             pyhd8ed1ab_1    conda-forge
     entrypoints               0.4                pyhd8ed1ab_0    conda-forge
     exceptiongroup            1.1.1              pyhd8ed1ab_0    conda-forge
     execnet                   1.9.0              pyhd8ed1ab_0    conda-forge
     executing                 1.2.0              pyhd8ed1ab_0    conda-forge
     expat                     2.5.0                hcb278e6_1    conda-forge
     fastavro                  1.7.4           py310h2372a71_0    conda-forge
     fastrlock                 0.8             py310hd8f1fbe_3    conda-forge
     filelock                  3.12.0             pyhd8ed1ab_0    conda-forge
     flask                     2.3.2              pyhd8ed1ab_0    conda-forge
     flask_cors                3.0.10             pyhd3deb0d_0    conda-forge
     flit-core                 3.9.0              pyhd8ed1ab_0    conda-forge
     fmt                       9.1.0                h924138e_0    conda-forge
     freetype                  2.12.1               hca18f0e_1    conda-forge
     frozenlist                1.3.3           py310h5764c6d_0    conda-forge
     fsspec                    2023.5.0           pyh1a96a4e_0    conda-forge
     future                    0.18.3             pyhd8ed1ab_0    conda-forge
     gcc                       11.3.0              h02d0930_13    conda-forge
     gcc_impl_linux-64         11.3.0              hab1b70f_19    conda-forge
     gcc_linux-64              11.3.0              he6f903b_13    conda-forge
     gflags                    2.2.2             he1b5a44_1004    conda-forge
     glog                      0.6.0                h6f12383_0    conda-forge
     gmock                     1.13.0               ha770c72_1    conda-forge
     gmp                       6.2.1                h58526e2_0    conda-forge
     gmpy2                     2.1.2           py310h3ec546c_1    conda-forge
     graphql-core              3.2.3              pyhd8ed1ab_0    conda-forge
     greenlet                  2.0.2           py310hc6cd4ac_1    conda-forge
     gtest                     1.13.0               h00ab1b0_1    conda-forge
     gxx                       11.3.0              h02d0930_13    conda-forge
     gxx_impl_linux-64         11.3.0              hab1b70f_19    conda-forge
     gxx_linux-64              11.3.0              hc203a17_13    conda-forge
     huggingface_hub           0.14.1             pyhd8ed1ab_0    conda-forge
     hypothesis                6.75.3             pyha770c72_0    conda-forge
     identify                  2.5.24             pyhd8ed1ab_0    conda-forge
     idna                      3.4                pyhd8ed1ab_0    conda-forge
     imagesize                 1.4.1              pyhd8ed1ab_0    conda-forge
     importlib-metadata        6.6.0              pyha770c72_0    conda-forge
     importlib_metadata        6.6.0                hd8ed1ab_0    conda-forge
     iniconfig                 2.0.0              pyhd8ed1ab_0    conda-forge
     intel-openmp              2022.1.0          h9e868ea_3769
     ipykernel                 6.23.1             pyh210e3f2_0    conda-forge
     ipython                   8.13.2             pyh41d4057_0    conda-forge
     ipython_genutils          0.2.0                      py_1    conda-forge
     itsdangerous              2.1.2              pyhd8ed1ab_0    conda-forge
     jedi                      0.18.2             pyhd8ed1ab_0    conda-forge
     jinja2                    3.1.2              pyhd8ed1ab_1    conda-forge
     jmespath                  1.0.1              pyhd8ed1ab_0    conda-forge
     joblib                    1.2.0              pyhd8ed1ab_0    conda-forge
     jschema-to-python         1.2.3              pyhd8ed1ab_0    conda-forge
     jsondiff                  2.0.0              pyhd8ed1ab_0    conda-forge
     jsonpatch                 1.32               pyhd8ed1ab_0    conda-forge
     jsonpickle                2.2.0              pyhd8ed1ab_0    conda-forge
     jsonpointer               2.0                        py_0    conda-forge
     jsonschema                3.2.0              pyhd8ed1ab_3    conda-forge
     junit-xml                 1.9                pyh9f0ad1d_0    conda-forge
     jupyter-cache             0.6.1              pyhd8ed1ab_0    conda-forge
     jupyter_client            8.2.0              pyhd8ed1ab_0    conda-forge
     jupyter_core              5.3.0           py310hff52083_0    conda-forge
     jupyter_events            0.6.3              pyhd8ed1ab_0    conda-forge
     jupyter_server            2.5.0              pyhd8ed1ab_0    conda-forge
     jupyter_server_terminals  0.4.4              pyhd8ed1ab_1    conda-forge
     jupyterlab_pygments       0.2.2              pyhd8ed1ab_0    conda-forge
     kernel-headers_linux-64   3.10.0              h4a8ded7_13    conda-forge
     keyutils                  1.6.1                h166bdaf_0    conda-forge
     krb5                      1.20.1               h81ceb04_0    conda-forge
     lcms2                     2.15                 haa2dc70_1    conda-forge
     ld_impl_linux-64          2.39                 hcc3a1bd_1    conda-forge
     lerc                      4.0.0                h27087fc_0    conda-forge
     libabseil                 20230125.2      cxx17_h59595ed_2    conda-forge
     libarrow                  11.0.0          h6564b11_20_cpu    conda-forge
     libblas                   3.9.0            16_linux64_mkl    conda-forge
     libbrotlicommon           1.0.9                h166bdaf_8    conda-forge
     libbrotlidec              1.0.9                h166bdaf_8    conda-forge
     libbrotlienc              1.0.9                h166bdaf_8    conda-forge
     libcblas                  3.9.0            16_linux64_mkl    conda-forge
     libcrc32c                 1.1.2                h9c3ff4c_0    conda-forge
     libcufile                 1.4.0.31                      0    nvidia
     libcufile-dev             1.4.0.31                      0    nvidia
     libcurand                 10.3.0.86                     0    nvidia
     libcurand-dev             10.3.0.86                     0    nvidia
     libcurl                   8.1.0                h409715c_0    conda-forge
     libdeflate                1.18                 h0b41bf4_0    conda-forge
     libedit                   3.1.20191231         he28a2e2_2    conda-forge
     libev                     4.33                 h516909a_1    conda-forge
     libevent                  2.1.12               h3358134_0    conda-forge
     libexpat                  2.5.0                hcb278e6_1    conda-forge
     libffi                    3.4.2                h7f98852_5    conda-forge
     libgcc-devel_linux-64     11.3.0              h210ce93_19    conda-forge
     libgcc-ng                 12.2.0              h65d4601_19    conda-forge
     libgfortran-ng            12.2.0              h69a702a_19    conda-forge
     libgfortran5              12.2.0              h337968e_19    conda-forge
     libgomp                   12.2.0              h65d4601_19    conda-forge
     libgoogle-cloud           2.10.1               hac9eb74_1    conda-forge
     libgrpc                   1.54.2               hb20ce57_2    conda-forge
     libiconv                  1.17                 h166bdaf_0    conda-forge
     libjpeg-turbo             2.1.5.1              h0b41bf4_0    conda-forge
     libkvikio                 23.06.00a       cuda11_230522_g2fbcd33_26    rapidsai-nightly
     liblapack                 3.9.0            16_linux64_mkl    conda-forge
     libllvm11                 11.1.0               he0ac6c6_5    conda-forge
     libnghttp2                1.52.0               h61bc06f_0    conda-forge
     libnsl                    2.0.0                h7f98852_0    conda-forge
     libntlm                   1.4               h7f98852_1002    conda-forge
     libnuma                   2.0.16               h0b41bf4_1    conda-forge
     libpng                    1.6.39               h753d276_0    conda-forge
     libprotobuf               3.21.12              h3eb15da_0    conda-forge
     librdkafka                1.9.2                ha5a0de0_2    conda-forge
     librmm                    23.06.00a       cuda11_230522_gc11ea8a5_19    rapidsai-nightly
     libsanitizer              11.3.0              h239ccf8_19    conda-forge
     libsodium                 1.0.18               h36c2ea0_1    conda-forge
     libsqlite                 3.42.0               h2797004_0    conda-forge
     libssh2                   1.10.0               hf14f497_3    conda-forge
     libstdcxx-devel_linux-64  11.3.0              h210ce93_19    conda-forge
     libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
     libthrift                 0.18.1               h8fd135c_1    conda-forge
     libtiff                   4.5.0                ha587672_6    conda-forge
     libutf8proc               2.8.0                h166bdaf_0    conda-forge
     libuuid                   2.38.1               h0b41bf4_0    conda-forge
     libuv                     1.44.2               h166bdaf_0    conda-forge
     libwebp-base              1.3.0                h0b41bf4_0    conda-forge
     libxcb                    1.15                 h0b41bf4_0    conda-forge
     libzlib                   1.2.13               h166bdaf_4    conda-forge
     livereload                2.6.3              pyh9f0ad1d_0    conda-forge
     llvmlite                  0.39.1          py310h58363a5_1    conda-forge
     locket                    1.0.0              pyhd8ed1ab_0    conda-forge
     lz4                       4.3.2           py310h0cfdcf0_0    conda-forge
     lz4-c                     1.9.4                hcb278e6_0    conda-forge
     makefun                   1.15.1             pyhd8ed1ab_0    conda-forge
     markdown                  3.4.3              pyhd8ed1ab_0    conda-forge
     markdown-it-py            2.2.0              pyhd8ed1ab_0    conda-forge
     markupsafe                2.1.2           py310h1fa729e_0    conda-forge
     matplotlib-inline         0.1.6              pyhd8ed1ab_0    conda-forge
     mdit-py-plugins           0.3.5              pyhd8ed1ab_0    conda-forge
     mdurl                     0.1.0              pyhd8ed1ab_0    conda-forge
     mimesis                   10.0.0             pyhd8ed1ab_0    conda-forge
     mistune                   2.0.5              pyhd8ed1ab_0    conda-forge
     mkl                       2022.1.0           hc2b9512_224
     moto                      4.1.10             pyhd8ed1ab_0    conda-forge
     mpc                       1.3.1                hfe3b2da_0    conda-forge
     mpfr                      4.2.0                hb012696_0    conda-forge
     msgpack-python            1.0.5           py310hdf3cbec_0    conda-forge
     multidict                 6.0.4           py310h1fa729e_0    conda-forge
     multiprocess              0.70.14         py310h5764c6d_3    conda-forge
     myst-nb                   0.17.2             pyhd8ed1ab_0    conda-forge
     myst-parser               0.18.1             pyhd8ed1ab_0    conda-forge
     nbclassic                 1.0.0              pyhb4ecaf3_1    conda-forge
     nbclient                  0.7.4              pyhd8ed1ab_0    conda-forge
     nbconvert                 7.2.9              pyhd8ed1ab_0    conda-forge
     nbconvert-core            7.2.9              pyhd8ed1ab_0    conda-forge
     nbconvert-pandoc          7.2.9              pyhd8ed1ab_0    conda-forge
     nbformat                  5.8.0              pyhd8ed1ab_0    conda-forge
     nbsphinx                  0.9.1              pyhd8ed1ab_0    conda-forge
     ncurses                   6.3                  h27087fc_1    conda-forge
     nest-asyncio              1.5.6              pyhd8ed1ab_0    conda-forge
     networkx                  2.8.8              pyhd8ed1ab_0    conda-forge
     ninja                     1.11.1               h924138e_0    conda-forge
     nodeenv                   1.8.0              pyhd8ed1ab_0    conda-forge
     notebook                  6.5.4              pyha770c72_0    conda-forge
     notebook-shim             0.2.3              pyhd8ed1ab_0    conda-forge
     numba                     0.56.4          py310h0e39c9b_1    conda-forge
     numpy                     1.23.5          py310h53a5b5f_0    conda-forge
     numpydoc                  1.5.0              pyhd8ed1ab_0    conda-forge
     nvcc_linux-64             11.8                h41dc85b_22    conda-forge
     nvtx                      0.2.5           py310h1fa729e_0    conda-forge
     openapi-schema-validator  0.2.3              pyhd8ed1ab_0    conda-forge
     openapi-spec-validator    0.4.0              pyhd8ed1ab_1    conda-forge
     openjpeg                  2.5.0                hfec8fc6_2    conda-forge
     openssl                   3.1.0                hd590300_3    conda-forge
     orc                       1.8.3                hfdbbad2_0    conda-forge
     packaging                 23.1               pyhd8ed1ab_0    conda-forge
     pandas                    1.5.3                    pypi_0    pypi
     pandoc                    3.1.2                h32600fe_1    conda-forge
     pandocfilters             1.5.0              pyhd8ed1ab_0    conda-forge
     paramiko                  3.1.0              pyhd8ed1ab_0    conda-forge
     parquet-cpp               1.5.1                         2    conda-forge
     parso                     0.8.3              pyhd8ed1ab_0    conda-forge
     partd                     1.4.0              pyhd8ed1ab_0    conda-forge
     pbr                       5.11.1             pyhd8ed1ab_0    conda-forge
     pexpect                   4.8.0              pyh1a96a4e_2    conda-forge
     pickleshare               0.7.5                   py_1003    conda-forge
     pillow                    9.5.0           py310h582fbeb_1    conda-forge
     pip                       23.1.2             pyhd8ed1ab_0    conda-forge
     platformdirs              3.5.1              pyhd8ed1ab_0    conda-forge
     pluggy                    1.0.0              pyhd8ed1ab_5    conda-forge
     pooch                     1.7.0              pyha770c72_3    conda-forge
     pre-commit                3.3.2              pyha770c72_0    conda-forge
     prometheus_client         0.16.0             pyhd8ed1ab_0    conda-forge
     prompt-toolkit            3.0.38             pyha770c72_0    conda-forge
     prompt_toolkit            3.0.38               hd8ed1ab_0    conda-forge
     protobuf                  4.21.12         py310heca2aa9_0    conda-forge
     psutil                    5.9.5           py310h1fa729e_0    conda-forge
     pthread-stubs             0.4               h36c2ea0_1001    conda-forge
     ptxcompiler               0.8.1           py310h01a121a_0    conda-forge
     ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
     pure_eval                 0.2.2              pyhd8ed1ab_0    conda-forge
     py-cpuinfo                9.0.0              pyhd8ed1ab_0    conda-forge
     pyarrow                   11.0.0          py310he6bfd7f_20_cpu    conda-forge
     pyasn1                    0.4.8                      py_0    conda-forge
     pycparser                 2.21               pyhd8ed1ab_0    conda-forge
     pydata-sphinx-theme       0.13.3             pyhd8ed1ab_0    conda-forge
     pygments                  2.15.1             pyhd8ed1ab_0    conda-forge
     pynacl                    1.5.0           py310h5764c6d_2    conda-forge
     pynvml                    11.4.1             pyhd8ed1ab_0    conda-forge
     pyopenssl                 23.1.1             pyhd8ed1ab_0    conda-forge
     pyorc                     0.8.0           py310hd52fb3e_4    conda-forge
     pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
     pyrsistent                0.19.3          py310h1fa729e_0    conda-forge
     pysocks                   1.7.1              pyha2e5f31_6    conda-forge
     pytest                    7.3.1              pyhd8ed1ab_0    conda-forge
     pytest-benchmark          4.0.0              pyhd8ed1ab_0    conda-forge
     pytest-cases              3.6.14             pyhd8ed1ab_0    conda-forge
     pytest-cov                4.0.0              pyhd8ed1ab_0    conda-forge
     pytest-xdist              3.3.1              pyhd8ed1ab_0    conda-forge
     python                    3.10.11         he550d4f_0_cpython    conda-forge
     python-confluent-kafka    1.9.2           py310h5764c6d_2    conda-forge
     python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
     python-fastjsonschema     2.17.1             pyhd8ed1ab_0    conda-forge
     python-jose               3.3.0              pyh6c4a22f_1    conda-forge
     python-json-logger        2.0.7              pyhd8ed1ab_0    conda-forge
     python-snappy             0.6.1           py310hcee4d7c_0    conda-forge
     python-xxhash             3.2.0           py310h1fa729e_0    conda-forge
     python_abi                3.10                    3_cp310    conda-forge
     pytorch                   1.11.0             py3.10_cpu_0    pytorch
     pytorch-mutex             1.0                         cpu    pytorch
     pytz                      2023.3             pyhd8ed1ab_0    conda-forge
     pywin32-on-windows        0.1.0              pyh1179c8e_3    conda-forge
     pyyaml                    6.0             py310h5764c6d_5    conda-forge
     pyzmq                     25.0.2          py310h059b190_0    conda-forge
     re2                       2023.03.02           h8c504da_0    conda-forge
     readline                  8.2                  h8228510_1    conda-forge
     recommonmark              0.7.1              pyhd8ed1ab_0    conda-forge
     regex                     2023.5.5        py310h2372a71_0    conda-forge
     requests                  2.31.0             pyhd8ed1ab_0    conda-forge
     responses                 0.18.0             pyhd8ed1ab_0    conda-forge
     rfc3339-validator         0.1.4              pyhd8ed1ab_0    conda-forge
     rfc3986-validator         0.1.1              pyh9f0ad1d_0    conda-forge
     rhash                     1.4.3                h166bdaf_0    conda-forge
     rmm                       23.06.00a       cuda11_py310_230522_gc11ea8a5_19    rapidsai-nightly
     rsa                       4.9                pyhd8ed1ab_0    conda-forge
     s2n                       1.3.44               h06160fa_0    conda-forge
     s3fs                      2023.5.0           pyhd8ed1ab_0    conda-forge
     s3transfer                0.6.1              pyhd8ed1ab_0    conda-forge
     sacremoses                0.0.53             pyhd8ed1ab_0    conda-forge
     sarif-om                  1.0.4              pyhd8ed1ab_0    conda-forge
     scikit-build              0.17.1             pyh56297ac_0    conda-forge
     scipy                     1.10.1          py310ha4c1d20_3    conda-forge
     sed                       4.8                  he412f7d_0    conda-forge
     send2trash                1.8.2              pyh41d4057_0    conda-forge
     setuptools                67.7.2             pyhd8ed1ab_0    conda-forge
     six                       1.16.0             pyh6c4a22f_0    conda-forge
     snappy                    1.1.10               h9fff704_0    conda-forge
     sniffio                   1.3.0              pyhd8ed1ab_0    conda-forge
     snowballstemmer           2.2.0              pyhd8ed1ab_0    conda-forge
     sortedcontainers          2.4.0              pyhd8ed1ab_0    conda-forge
     soupsieve                 2.3.2.post1        pyhd8ed1ab_0    conda-forge
     spdlog                    1.11.0               h9b3ece8_1    conda-forge
     sphinx                    5.3.0              pyhd8ed1ab_0    conda-forge
     sphinx-autobuild          2021.3.14          pyhd8ed1ab_0    conda-forge
     sphinx-copybutton         0.5.2              pyhd8ed1ab_0    conda-forge
     sphinx-markdown-tables    0.0.17             pyh6c4a22f_0    conda-forge
     sphinxcontrib-applehelp   1.0.4              pyhd8ed1ab_0    conda-forge
     sphinxcontrib-devhelp     1.0.2                      py_0    conda-forge
     sphinxcontrib-htmlhelp    2.0.1              pyhd8ed1ab_0    conda-forge
     sphinxcontrib-jsmath      1.0.1                      py_0    conda-forge
     sphinxcontrib-qthelp      1.0.3                      py_0    conda-forge
     sphinxcontrib-serializinghtml 1.1.5              pyhd8ed1ab_2    conda-forge
     sphinxcontrib-websupport  1.2.4              pyhd8ed1ab_1    conda-forge
     sqlalchemy                2.0.15          py310h2372a71_0    conda-forge
     sshpubkeys                3.3.1              pyhd8ed1ab_0    conda-forge
     stack_data                0.6.2              pyhd8ed1ab_0    conda-forge
     streamz                   0.6.4              pyh6c4a22f_0    conda-forge
     sysroot_linux-64          2.17                h4a8ded7_13    conda-forge
     tabulate                  0.9.0              pyhd8ed1ab_1    conda-forge
     tblib                     1.7.0              pyhd8ed1ab_0    conda-forge
     terminado                 0.17.1             pyh41d4057_0    conda-forge
     tinycss2                  1.2.1              pyhd8ed1ab_0    conda-forge
     tk                        8.6.12               h27826a3_0    conda-forge
     tokenizers                0.13.1          py310h633acb5_2    conda-forge
     toml                      0.10.2             pyhd8ed1ab_0    conda-forge
     tomli                     2.0.1              pyhd8ed1ab_0    conda-forge
     toolz                     0.12.0             pyhd8ed1ab_0    conda-forge
     tornado                   6.3.2           py310h2372a71_0    conda-forge
     tqdm                      4.65.0             pyhd8ed1ab_1    conda-forge
     traitlets                 5.9.0              pyhd8ed1ab_0    conda-forge
     transformers              4.24.0             pyhd8ed1ab_0    conda-forge
     typing-extensions         4.5.0                hd8ed1ab_0    conda-forge
     typing_extensions         4.5.0              pyha770c72_0    conda-forge
     tzdata                    2023.3                   pypi_0    pypi
     ucx                       1.14.1               h8c404fb_0    conda-forge
     ukkonen                   1.0.1           py310hbf28c38_3    conda-forge
     urllib3                   1.26.15            pyhd8ed1ab_0    conda-forge
     virtualenv                20.23.0            pyhd8ed1ab_0    conda-forge
     wcwidth                   0.2.6              pyhd8ed1ab_0    conda-forge
     webencodings              0.5.1                      py_1    conda-forge
     websocket-client          1.5.2              pyhd8ed1ab_0    conda-forge
     werkzeug                  2.3.4              pyhd8ed1ab_0    conda-forge
     wheel                     0.40.0             pyhd8ed1ab_0    conda-forge
     wrapt                     1.15.0          py310h1fa729e_0    conda-forge
     xmltodict                 0.13.0             pyhd8ed1ab_0    conda-forge
     xorg-libxau               1.0.11               hd590300_0    conda-forge
     xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
     xxhash                    0.8.1                h0b41bf4_0    conda-forge
     xz                        5.2.6                h166bdaf_0    conda-forge
     yaml                      0.2.5                h7f98852_2    conda-forge
     yarl                      1.9.1           py310h2372a71_0    conda-forge
     zeromq                    4.3.4                h9c3ff4c_1    conda-forge
     zict                      3.0.0              pyhd8ed1ab_0    conda-forge
     zipp                      3.15.0             pyhd8ed1ab_0    conda-forge
     zlib                      1.2.13               h166bdaf_4    conda-forge
     zstd                      1.5.2                h3eb15da_6    conda-forge

Additional context Add any other context about the problem here.

galipremsagar commented 1 year ago

Maybe a related PR previously worked on similar issue: https://github.com/rapidsai/cudf/pull/11854

galipremsagar commented 1 year ago

a_cudf.zip a_pyarrow.zip

GregoryKimball commented 1 year ago

This seems to be a problem where libcudf is not writing datetime64[s] or timedelta64[s] correctly. My testing shows that libcudf is also not roundtripping it faithfully:

import pyarrow as pa
import cudf

for type in [
    'timedelta64[s]',
    'timedelta64[ms]',
    'timedelta64[us]',
    'timedelta64[ns]',
    'datetime64[s]',
    'datetime64[ms]',
    'datetime64[us]',
    'datetime64[ns]',
]:

    df = cudf.DataFrame({"s": cudf.Series([1234, 3456, 32442], dtype=type)})
    df.to_parquet("a")
    df2 = cudf.read_parquet("a")
    df3 = pa.parquet.read_table("a")    

    print(df['s'].dtype, df2['s'].dtype, df3['s'].type)

output timedelta64[s] timedelta64[ms] time32[ms] timedelta64[ms] timedelta64[ms] time32[ms] timedelta64[us] timedelta64[us] time64[us] timedelta64[ns] timedelta64[ns] time64[ns] datetime64[s] datetime64[ms] timestamp[ms] datetime64[ms] datetime64[ms] timestamp[ms] datetime64[us] datetime64[us] timestamp[us] datetime64[ns] datetime64[ns] timestamp[ns]

mhaseeb123 commented 5 months ago

Investigation Notes:

  1. SECONDS is not as a valid TimeUnit in Parquet and hence converted to milliseconds by both cudf and arrow.
  2. I have been able to locally update this behavior and add SECONDS to our TimeUnit enum class. It round-trips correctly with cudf but produces an error when read with pyarrow's parquet reader (invalid unit)
  3. cudf's timedelta actually corresponds to Arrow's duration type instead of time type as seen with cudf's to_arrow and from_arrow functions. However, it is not yet possible to convert between timedelta64 and duration by only using Parquet spec.
  4. This is because Arrow encodes duration as int64 in parquet instead of TimeType. Arrow does it by also writing serialized arrow schema with parquet files: https://github.com/apache/arrow/issues/23117 and https://github.com/apache/arrow/pull/12449/.
  5. Arrow types and Parquet types are different sets mapped as needed using arrow schema as a part of parquet file.
mhaseeb123 commented 2 months ago

Update:

Support for duration[s]/timedelta64[s] types has been added via arrow:schema support in cuDF PQ reader and writer and roundtrips faithfully.

For datetime64[s]/timestamp[s], both cuDF and Arrow convert [s] units to [ms] when writing Parquet and interop/roundtrip faithfully regardless of the unit. Though Arrow does not use arrow:schema to correct units, we can do so in cuDF if needed.

Question is: Should we do it or leave it be as the notion of unit in timestamp columns seems arbitrary (in both cuDF and Arrow) as the data are treated, displayed and interpreted in terms of absolute values since epoch (e.g. 1970-01-01 00:00:01.234) regardless of the unit. Example:

def datetime_interop():
    for type in [
        "timestamp[s]",
        "timestamp[ms]",
        "timestamp[us]",
    ]:
        times = pa.array(
            [1234, 3456, 32442], type=type
        )
        names = ["d"]
        pa_table = pa.Table.from_arrays([times], names=names)
        buf = BytesIO()

        pq.write_table(pa_table, buf)
        df2 = cudf.read_parquet(buf)
        df3 = pq.read_table(buf)

        # prints the same values (ignore units)
        print("Original table (pa)\n", pa_table)
        print("cudf read parquet\n", df2)
        print("pyarrow read parquet\n", df3)

        # convert all to pd.Timestamp without caring about column units
        value1 = pd.Timestamp(pa_table["d"][0].as_py())
        value2 = pd.Timestamp(df2["d"][0])
        value3 = pd.Timestamp(df3["d"][0].as_py())

        # check equality
        assert value1 == value2
        assert value1 == value3
        # redundant but anyway
        assert value2 == value3
mhaseeb123 commented 2 months ago

Closing this issue for now as units are meaningless for timestamp types as they are treated and displayed in absolute values. Please see the last comment with updates.