[BUG] cudf.to_pandas doesn't handle correctly datetime[ms] #3206

Closed aucahuasi closed 4 years ago

aucahuasi commented 4 years ago

Describe the bug If the cudf has a column of type datetime[ms] and if I use cudf.to_pandas then I get the column but with an incorrect time resolution. In this case it seems it always returns datetime[ns]

Steps/Code to reproduce bug

  1. Load a csv file that contains a date64 column
  2. Print the cudf and you will see the column type is datetime64[ms] (Note here the ms: milliseconds)
  3. Use cudf.to_pandas and check the resulting dataframe, it will have the datetime column but with ns (nanoseconds)

Here the script that reproduce the issue:

import numpy as np
import pandas as pd
import cudf

file_path = "/opt/tpch_tables/orders.csv"

column_names = [ 'o_orderkey', 'o_custkey', 'o_orderstatus', 'o_totalprice', 'o_orderdate', 'o_orderpriority', 'o_clerk', 'o_shippriority', 'o_comment']

data_types = ["int64", "int32", "str", "float64", "date64", "str", "str", "str", "str"]

gdf = cudf.read_csv(file_path, delimiter = '|', names = column_names, dtype =  data_types)

print("Input CUDF:")

df = gdf.to_pandas()


print("Input Pandas Dataframe with bad datetime unit:")

Expected behavior to_pandas should return a dataframe with correct time resolution if the column is a datetime.

Environment overview (please complete the following information)

Environment details

Additional context This issue also affects dask-cudf when the dask work want to process a cudf with type datetime.

kkraus14 commented 4 years ago

This is a limitation of Pandas only supporting datetime64[ns] and not a bug.