rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.46k stars 908 forks source link

[BUG] `cudf.read_json` is incorrectly parsing TimeStamp typed columns #6382

Open rgsl888prabhu opened 4 years ago

rgsl888prabhu commented 4 years ago

Describe the bug cudf.read_json is failing to parse DateTime64 typed columns correctly when expected dtype is provided.

Steps/Code to reproduce bug

>>> import cudf
>>> import pandas as pd
>>> pdf = pd.DataFrame({"a":[45461150050, 55414521000, 4544624522000, 4546345758000, 45445254600]}, dtype='datetime64[ms]')
>>> pdf
                        a
0 1970-01-01 00:00:45.461
1 1970-01-01 00:00:55.414
2 1970-01-01 01:15:44.624
3 1970-01-01 01:15:46.345
4 1970-01-01 00:00:45.445
>>> buffer = pdf.to_json(compression='infer', lines=True, orient="records")
>>> buffer
'{"a":45461}\n{"a":55414}\n{"a":4544624}\n{"a":4546345}\n{"a":45445}'
>>> df = cudf.read_json(buffer, ompression='infer', lines=True, orient="records", dtype=['timestamp[ms]'])
>>> df
                        a
0 1969-12-31 23:59:59.999
1 1969-12-31 23:59:59.999
2 1969-12-31 23:59:59.999
3 1969-12-31 23:59:59.999
4 1969-12-31 23:59:59.999

If dtype isn't specified, and if we cast the resulting int64 column, we get expected result

>>> expected_df = cudf.read_json(buffer, ompression='infer', lines=True, orient="records")
>>> expected_df['a'] = expected_df['a'].astype('datetime64[ms]')
>>> expected_df
                        a
0 1970-01-01 00:00:45.461
1 1970-01-01 00:00:55.414
2 1970-01-01 01:15:44.624
3 1970-01-01 01:15:46.345
4 1970-01-01 00:00:45.445
>>> 

Expected behavior cudf.read_json should handle dtype arguement.

>>> df = cudf.read_json(buffer, ompression='infer', lines=True, orient="records", dtype=['timestamp[ms]'])
>>> df

                        a
0 1970-01-01 00:00:45.461
1 1970-01-01 00:00:55.414
2 1970-01-01 01:15:44.624
3 1970-01-01 01:15:46.345
4 1970-01-01 00:00:45.445
>>> 

Environment overview (please complete the following information)

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

vyasr commented 6 months ago

This still fails, but it does so differently now in dtype detection of the timestamp type (also the signature of read_json has changed subtly, the dtype needs to be a dict now):

In [6]: >>> import cudf
   ...: >>> import pandas as pd
   ...: >>> pdf = pd.DataFrame({"a":[45461150050, 55414521000, 4544624522000, 4546345758000, 45445254600]}, dtype='datetime64[ms]')
   ...: >>> buffer = pdf.to_json(compression='infer', lines=True, orient="records")
   ...: >>> buffer
   ...: '{"a":45461}\n{"a":55414}\n{"a":4544624}\n{"a":4546345}\n{"a":45445}'
   ...: >>> df = cudf.read_json(buffer, compression='infer', lines=True, orient="records", dtype={"a": 'timestamp[ms]'})
   ...: >>> df
...

File ~/.conda/envs/rapids/lib/python3.10/site-packages/pandas/core/dtypes/common.py:1645, in pandas_dtype(dtype)
   1640     with warnings.catch_warnings():
   1641         # GH#51523 - Series.astype(np.integer) doesn't show
   1642         # numpy deprecation warning of np.integer
   1643         # Hence enabling DeprecationWarning
   1644         warnings.simplefilter("always", DeprecationWarning)
-> 1645         npdtype = np.dtype(dtype)
   1646 except SyntaxError as err:
   1647     # np.dtype uses `eval` which can raise SyntaxError
   1648     raise TypeError(f"data type '{dtype}' not understood") from err

TypeError: data type 'timestamp[ms]' not understood