[BUG] `cudf.read_json` is incorrectly parsing TimeStamp typed columns

rgsl888prabhu commented 4 years ago

Describe the bug cudf.read_json is failing to parse DateTime64 typed columns correctly when expected dtype is provided.

Steps/Code to reproduce bug

>>> import cudf
>>> import pandas as pd
>>> pdf = pd.DataFrame({"a":[45461150050, 55414521000, 4544624522000, 4546345758000, 45445254600]}, dtype='datetime64[ms]')
>>> pdf
                        a
0 1970-01-01 00:00:45.461
1 1970-01-01 00:00:55.414
2 1970-01-01 01:15:44.624
3 1970-01-01 01:15:46.345
4 1970-01-01 00:00:45.445
>>> buffer = pdf.to_json(compression='infer', lines=True, orient="records")
>>> buffer
'{"a":45461}\n{"a":55414}\n{"a":4544624}\n{"a":4546345}\n{"a":45445}'
>>> df = cudf.read_json(buffer, ompression='infer', lines=True, orient="records", dtype=['timestamp[ms]'])
>>> df
                        a
0 1969-12-31 23:59:59.999
1 1969-12-31 23:59:59.999
2 1969-12-31 23:59:59.999
3 1969-12-31 23:59:59.999
4 1969-12-31 23:59:59.999

If dtype isn't specified, and if we cast the resulting int64 column, we get expected result

>>> expected_df = cudf.read_json(buffer, ompression='infer', lines=True, orient="records")
>>> expected_df['a'] = expected_df['a'].astype('datetime64[ms]')
>>> expected_df
                        a
0 1970-01-01 00:00:45.461
1 1970-01-01 00:00:55.414
2 1970-01-01 01:15:44.624
3 1970-01-01 01:15:46.345
4 1970-01-01 00:00:45.445
>>>

Expected behavior cudf.read_json should handle dtype arguement.

>>> df = cudf.read_json(buffer, ompression='infer', lines=True, orient="records", dtype=['timestamp[ms]'])
>>> df

                        a
0 1970-01-01 00:00:45.461
1 1970-01-01 00:00:55.414
2 1970-01-01 01:15:44.624
3 1970-01-01 01:15:46.345
4 1970-01-01 00:00:45.445
>>>

Environment overview (please complete the following information)

Environment location: Bare-metal
Method of cuDF install: conda

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

vyasr commented 6 months ago

This still fails, but it does so differently now in dtype detection of the timestamp type (also the signature of read_json has changed subtly, the dtype needs to be a dict now):

In [6]: >>> import cudf
   ...: >>> import pandas as pd
   ...: >>> pdf = pd.DataFrame({"a":[45461150050, 55414521000, 4544624522000, 4546345758000, 45445254600]}, dtype='datetime64[ms]')
   ...: >>> buffer = pdf.to_json(compression='infer', lines=True, orient="records")
   ...: >>> buffer
   ...: '{"a":45461}\n{"a":55414}\n{"a":4544624}\n{"a":4546345}\n{"a":45445}'
   ...: >>> df = cudf.read_json(buffer, compression='infer', lines=True, orient="records", dtype={"a": 'timestamp[ms]'})
   ...: >>> df
...

File ~/.conda/envs/rapids/lib/python3.10/site-packages/pandas/core/dtypes/common.py:1645, in pandas_dtype(dtype)
   1640     with warnings.catch_warnings():
   1641         # GH#51523 - Series.astype(np.integer) doesn't show
   1642         # numpy deprecation warning of np.integer
   1643         # Hence enabling DeprecationWarning
   1644         warnings.simplefilter("always", DeprecationWarning)
-> 1645         npdtype = np.dtype(dtype)
   1646 except SyntaxError as err:
   1647     # np.dtype uses `eval` which can raise SyntaxError
   1648     raise TypeError(f"data type '{dtype}' not understood") from err

TypeError: data type 'timestamp[ms]' not understood

rapidsai / cudf

[BUG] `cudf.read_json` is incorrectly parsing TimeStamp typed columns #6382