rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.42k stars 901 forks source link

[BUG] Orc test failure (test_orc_write_statistics) #7314

Closed shwina closed 5 months ago

shwina commented 3 years ago

The following tests fail reliably for me:

FAILED cudf/tests/test_orc.py::test_orc_write_statistics[100] - AssertionError: assert False
FAILED cudf/tests/test_orc.py::test_orc_write_statistics[6000000] - AssertionError: assert False

Stack trace:

``` ============================= test session starts ============================== platform linux -- Python 3.7.9, pytest-6.2.2, py-1.10.0, pluggy-0.13.1 benchmark: 3.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=1 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=0) rapids_pytest_benchmark: 0.0.13 rootdir: /home/ashwin/workspace/rapids/cudf/python/cudf, configfile: setup.cfg plugins: benchmark-3.2.3, rapids-pytest-benchmark-0.0.13, xdist-2.2.0, hypothesis-6.1.1, forked-1.3.0 collected 92 items / 89 deselected / 3 selected cudf/tests/test_orc.py .FF [100%] =================================== FAILURES =================================== ________________________ test_orc_write_statistics[100] ________________________ tmpdir = local('/tmp/pytest-of-rapids/pytest-1/test_orc_write_statistics_100_0') datadir = PosixPath('/home/ashwin/workspace/rapids/cudf/python/cudf/cudf/tests/data/orc') nrows = 100 @pytest.mark.parametrize("nrows", [1, 100, 6000000]) def test_orc_write_statistics(tmpdir, datadir, nrows): supported_stat_types = supported_numpy_dtypes + ["str"] # Make a dataframe gdf = cudf.DataFrame( { "col_" + str(dtype): gen_rand_series(dtype, nrows, has_nulls=True) for dtype in supported_stat_types } ) fname = tmpdir.join("gdf.orc") # Write said dataframe to ORC with cuDF gdf.to_orc(fname.strpath) # Read back written ORC's statistics orc_file = pa.orc.ORCFile(fname) (file_stats, stripes_stats,) = cudf.io.orc.read_orc_statistics(fname) # check file stats for col in gdf: if "minimum" in file_stats[col]: stats_min = file_stats[col]["minimum"] actual_min = gdf[col].min() assert normalized_equals(actual_min, stats_min) if "maximum" in file_stats[col]: stats_max = file_stats[col]["maximum"] actual_max = gdf[col].max() assert normalized_equals(actual_max, stats_max) # compare stripe statistics with actual min/max for stripe_idx in range(0, orc_file.nstripes): stripe = orc_file.read_stripe(stripe_idx) # pandas is unable to handle min/max of string col with nulls stripe_df = cudf.DataFrame(stripe.to_pandas()) for col in stripe_df: if "minimum" in stripes_stats[stripe_idx][col]: actual_min = stripe_df[col].min() stats_min = stripes_stats[stripe_idx][col]["minimum"] > assert normalized_equals(actual_min, stats_min) E AssertionError: assert False E + where False = normalized_equals(numpy.datetime64('1970-05-30T18:27:14.488000000'), datetime.datetime(1970, 5, 30, 17, 27, 14, 488000, tzinfo=datetime.timezone.utc)) cudf/tests/test_orc.py:611: AssertionError ______________________ test_orc_write_statistics[6000000] ______________________ tmpdir = local('/tmp/pytest-of-rapids/pytest-1/test_orc_write_statistics_60000') datadir = PosixPath('/home/ashwin/workspace/rapids/cudf/python/cudf/cudf/tests/data/orc') nrows = 6000000 @pytest.mark.parametrize("nrows", [1, 100, 6000000]) def test_orc_write_statistics(tmpdir, datadir, nrows): supported_stat_types = supported_numpy_dtypes + ["str"] # Make a dataframe gdf = cudf.DataFrame( { "col_" + str(dtype): gen_rand_series(dtype, nrows, has_nulls=True) for dtype in supported_stat_types } ) fname = tmpdir.join("gdf.orc") # Write said dataframe to ORC with cuDF gdf.to_orc(fname.strpath) # Read back written ORC's statistics orc_file = pa.orc.ORCFile(fname) (file_stats, stripes_stats,) = cudf.io.orc.read_orc_statistics(fname) # check file stats for col in gdf: if "minimum" in file_stats[col]: stats_min = file_stats[col]["minimum"] actual_min = gdf[col].min() assert normalized_equals(actual_min, stats_min) if "maximum" in file_stats[col]: stats_max = file_stats[col]["maximum"] actual_max = gdf[col].max() assert normalized_equals(actual_max, stats_max) # compare stripe statistics with actual min/max for stripe_idx in range(0, orc_file.nstripes): stripe = orc_file.read_stripe(stripe_idx) # pandas is unable to handle min/max of string col with nulls stripe_df = cudf.DataFrame(stripe.to_pandas()) for col in stripe_df: if "minimum" in stripes_stats[stripe_idx][col]: actual_min = stripe_df[col].min() stats_min = stripes_stats[stripe_idx][col]["minimum"] assert normalized_equals(actual_min, stats_min) if "maximum" in stripes_stats[stripe_idx][col]: actual_max = stripe_df[col].max() stats_max = stripes_stats[stripe_idx][col]["maximum"] > assert normalized_equals(actual_max, stats_max) E AssertionError: assert False E + where False = normalized_equals(numpy.datetime64('2001-09-09T01:25:37.291000000'), datetime.datetime(2001, 9, 9, 0, 25, 37, 291000, tzinfo=datetime.timezone.utc)) cudf/tests/test_orc.py:616: AssertionError =============================== warnings summary =============================== ../../../rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/numba/core/types/__init__.py:108 /home/ashwin/workspace/rapids/rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/numba/core/types/__init__.py:108: DeprecationWarning: `np.long` is a deprecated alias for `np.compat.long`. To silence this warning, use `np.compat.long` by itself. In the likely event your code does not need to work on Python 2 you can use the builtin `int` for which `np.compat.long` is itself an alias. Doing this will not modify any behaviour and is safe. When replacing `np.long`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations long_ = _make_signed(np.long) ../../../rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/numba/core/types/__init__.py:109 /home/ashwin/workspace/rapids/rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/numba/core/types/__init__.py:109: DeprecationWarning: `np.long` is a deprecated alias for `np.compat.long`. To silence this warning, use `np.compat.long` by itself. In the likely event your code does not need to work on Python 2 you can use the builtin `int` for which `np.compat.long` is itself an alias. Doing this will not modify any behaviour and is safe. When replacing `np.long`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations ulong = _make_unsigned(np.long) ../../../rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/numba/cuda/api.py:112 /home/ashwin/workspace/rapids/rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/numba/cuda/api.py:112: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations def device_array(shape, dtype=np.float, strides=None, order='C', stream=0): ../../../rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/numba/cuda/api.py:124 /home/ashwin/workspace/rapids/rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/numba/cuda/api.py:124: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations def managed_array(shape, dtype=np.float, strides=None, order='C', stream=0, ../../../rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/numba/cuda/api.py:154 /home/ashwin/workspace/rapids/rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/numba/cuda/api.py:154: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations def pinned_array(shape, dtype=np.float, strides=None, order='C'): ../../../rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/numba/cuda/api.py:170 /home/ashwin/workspace/rapids/rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/numba/cuda/api.py:170: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations def mapped_array(shape, dtype=np.float, strides=None, order='C', stream=0, ../../../rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/numba/cuda/simulator/cudadrv/devicearray.py:313 /home/ashwin/workspace/rapids/rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/numba/cuda/simulator/cudadrv/devicearray.py:313: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations def pinned_array(shape, dtype=np.float, strides=None, order='C'): ../../../rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/numba/cuda/simulator/cudadrv/devicearray.py:317 /home/ashwin/workspace/rapids/rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/numba/cuda/simulator/cudadrv/devicearray.py:317: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations def managed_array(shape, dtype=np.float, strides=None, order='C'): ../../../rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/cupyx/scipy/sparse/_index.py:22 /home/ashwin/workspace/rapids/rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/cupyx/scipy/sparse/_index.py:22: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations _bool_scalar_types = (bool, numpy.bool, numpy.bool_) cudf/utils/utils.py:196 /home/ashwin/workspace/rapids/cudf/python/cudf/cudf/utils/utils.py:196: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations def get_null_series(size, dtype=np.bool): cudf/tests/test_orc.py::test_orc_write_statistics[1] cudf/tests/test_orc.py::test_orc_write_statistics[100] cudf/tests/test_orc.py::test_orc_write_statistics[6000000] /home/ashwin/workspace/rapids/cudf/python/cudf/cudf/tests/utils.py:262: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations return np.random.randint(low=low, high=high, size=size).astype(np.bool) cudf/tests/test_orc.py::test_orc_write_statistics[1] /home/ashwin/workspace/rapids/rapids-compose/etc/conda/cuda_10.1/envs/rapids/lib/python3.7/site-packages/pandas/util/__init__.py:12: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing cudf/tests/test_orc.py: 15 warnings /home/ashwin/workspace/rapids/cudf/python/cudf/cudf/tests/test_orc.py:562: DeprecationWarning: parsing timezone aware datetimes is deprecated; this will raise an error in the future value2 = np.datetime64(value2, "ms") -- Docs: https://docs.pytest.org/en/stable/warnings.html =========================== short test summary info ============================ FAILED cudf/tests/test_orc.py::test_orc_write_statistics[100] - AssertionErro... FAILED cudf/tests/test_orc.py::test_orc_write_statistics[6000000] - Assertion... =========== 2 failed, 1 passed, 89 deselected, 29 warnings in 5.52s ============ ```
github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

bdice commented 2 years ago

I still see local failures on these tests with the latest branch-22.02.

bdice commented 2 years ago

For col = "col_datetime64[ms]" I see a difference of an hour:

>>> actual_max
numpy.datetime64('2001-09-09T01:55:11.684000000')
>>> stats_max
datetime.datetime(2001, 9, 9, 0, 55, 11, 684000, tzinfo=datetime.timezone.utc)

I wonder if this is a DST issue or something weird like that. I see the same "off by one hour" bug in test_writer_timestamp_stream_size as well.

bdice commented 2 years ago
>>> expect.min()
_c0   1679-07-22 14:26:48.910277376
dtype: datetime64[ns]
>>> expect.loc[(expect != got).values].min()
_c0   1918-04-12 16:21:19.290896768
dtype: datetime64[ns]

the United States adopted daylight saving in 1918 https://en.wikipedia.org/wiki/Daylight_saving_time

🤔

bdice commented 2 years ago

I can confirm that all the timestamps that pass the equality check are before 1918 OR (after 1918 AND in months when DST is not active). The timestamps that fail the equality check are all (after 1918 AND in months when DST is active).

Passing:

>>> got.loc[(expect == got).values].head(20)
_c0
0  1726-12-14 05:28:27.950277376
2  1772-10-25 04:57:04.040828992
3  2240-12-06 07:10:00.416241920
7  2043-12-05 05:41:45.725587072
12 2202-11-28 10:44:24.912241920
13 1836-09-30 14:26:55.601483840
14 2221-12-16 02:19:01.204587072
17 1875-02-07 23:34:10.171932224
18 1701-08-10 02:37:27.225483840
19 1756-09-18 23:13:14.606828992
21 1947-04-05 14:43:22.362448384
22 2193-11-03 19:31:02.534345152
23 2071-12-11 15:06:56.374896768
24 2198-12-27 14:37:49.142448384
25 2176-02-28 10:25:24.113828992
27 2228-02-22 14:51:58.464793536
29 2028-11-18 14:42:40.120345152
33 1791-04-04 20:47:05.787793536
34 1697-11-17 00:14:11.477828992
39 1768-05-05 17:35:22.858828992

Failing (ignore the NaT rows because NaT != NaT):

>>> got.loc[(expect != got).values].head(20)
_c0
1  2212-05-13 03:48:53.347138688
4  2164-10-05 18:55:36.266345152
5  2047-05-09 18:40:33.687896768
6  2085-09-08 08:24:20.249380608
8                            NaT
9  2111-10-25 12:44:13.615896768
10 2043-03-11 09:41:44.077448384
11 1992-06-08 16:43:58.214241920
15 1943-11-21 00:31:30.609587072
16 2217-04-21 22:12:28.090448384
20 2053-08-25 04:43:17.684932224
26 2242-05-13 08:58:41.036138688
28 2015-04-14 04:38:16.134241920
30                           NaT
31 2197-07-10 15:46:38.144793536
32 2072-08-14 01:35:57.770380608
35 1951-06-18 22:07:41.504793536
36 2178-04-17 12:26:21.656483840
37 2134-10-24 07:17:33.455896768
38 2101-04-28 05:54:35.066448384

One possible exception: 1943-11-21 00:31:30.609587072 seems weird, right? Well, the United States observed DST all of 1943.

cc: @vuule @devavret

bdice commented 2 years ago

I discussed this with @vuule today. I have produced a minimal test case that disagrees between PyArrow's file reader and cudf.read_orc. This uses cudf 95909982d2, pandas 1.3.5, and pyarrow 5.0.0.

import cudf
import datetime
import pandas as pd
import pyarrow

# Write an ORC file with a timestamp using cuDF and PyArrow
pdf = pd.DataFrame(
    {"dst_timestamp": [pd.Timestamp("1981-05-18 21:00:08.262378")]}
)
cudf.DataFrame(pdf).to_orc("dst_timestamp_cudf.orc")
pyarrow.orc.write_table(
    pyarrow.Table.from_pandas(pdf), "dst_timestamp_pyarrow.orc"
)

# Read each file with PyArrow and cuDF
for filename in ("dst_timestamp_cudf.orc", "dst_timestamp_pyarrow.orc"):
    orcfile = pyarrow.orc.ORCFile(filename)
    pdf = cudf.DataFrame(orcfile.read().to_pandas())
    print(f"PyArrow reading {filename}")
    print(pdf)
    gdf = cudf.read_orc(filename)
    print(f"cuDF reading {filename}")
    print(gdf)
    print(f"Difference (PyArrow - cuDF), {filename}")
    print(pdf - gdf)
    print()

# I have confirmed that PyArrow and cudf agree if the system is in the UTC time
# zone, but not if the system is in a time zone that observes DST like
# America/Chicago (Central, which is currently CST).
print("Current timezone:")
print(datetime.datetime.now().astimezone().tzinfo)

Output:

PyArrow reading dst_timestamp_cudf.orc
               dst_timestamp
0 1981-05-18 22:00:08.262378
cuDF reading dst_timestamp_cudf.orc
               dst_timestamp
0 1981-05-18 21:00:08.262378
Difference (PyArrow - cuDF), dst_timestamp_cudf.orc
    dst_timestamp
0 0 days 01:00:00

PyArrow reading dst_timestamp_pyarrow.orc
               dst_timestamp
0 1981-05-18 21:00:08.262378
cuDF reading dst_timestamp_pyarrow.orc
               dst_timestamp
0 1981-05-18 21:00:08.262378
Difference (PyArrow - cuDF), dst_timestamp_pyarrow.orc
  dst_timestamp
0        0 days
Current timezone:
CST

From this, we can narrow it down to a specific case: PyArrow does not read the correct data from a file written by cuDF.

I have a hypothesis about the reason for this, from earlier discussion with @vuule. Using the ORC tools, I get the following:

$ orc-contents dst_timestamp_cudf.orc 
{"dst_timestamp": "1981-05-18 21:00:08.262378"}
$ orc-contents dst_timestamp_pyarrow.orc 
{"dst_timestamp": "1981-05-18 21:00:08.262378"}

The above snippet shows that the orc-contents (reference?) reader sees the same timestamp in both files. But the metadata differs:

$ orc-metadata dst_timestamp_cudf.orc 
{ "name": "dst_timestamp_cudf.orc",
  "type": "struct<dst_timestamp:timestamp>",
  "attributes": {},
  "rows": 1,
  "stripe count": 1,
  "format": "0.12", "writer version": "original", "software version": "ORC Java",
  "compression": "none",
  "file length": 192,
  "content": 75, "stripe stats": 26, "footer": 73, "postscript": 17,
  "row index stride": 10000,
  "user metadata": {
  },
  "stripes": [
    { "stripe": 0, "rows": 1,
      "offset": 3, "length": 72,
      "index": 12, "data": 15, "footer": 45
    }
  ]
}
$ orc-metadata dst_timestamp_pyarrow.orc 
{ "name": "dst_timestamp_pyarrow.orc",
  "type": "struct<dst_timestamp:timestamp>",
  "attributes": {},
  "rows": 1,
  "stripe count": 1,
  "format": "0.12", "writer version": "ORC-135", "software version": "ORC C++ 1.7.1",
  "compression": "zlib", "compression block": 65536,
  "file length": 294,
  "content": 130, "stripe stats": 37, "footer": 100, "postscript": 23,
  "row index stride": 10000,
  "user metadata": {
  },
  "stripes": [
    { "stripe": 0, "rows": 1,
      "offset": 3, "length": 130,
      "index": 53, "data": 27, "footer": 50
    }
  ]
}

Notably, the file written by pyarrow says "writer version": "ORC-135" while the file written by cuDF says "writer version": "original".

The ORCv1 spec is here and notes,

Before ORC-135, the local timezone offset was included and they were stored as minimum and maximum. After ORC-135, the timestamp is adjusted to UTC before being converted to milliseconds and stored in minimumUtc and maximumUtc.

Here's the issue for ORC-135.

I hope this is helpful diagnostic information - I am not yet sure where to look next. I am also unsure about whether to treat this as a bug in cuDF's metadata writing, or a bug in pyarrow's metadata reading.

vuule commented 2 years ago

Thanks @bdice! We can try setting the version to "ORC-135" in the writer, see if that fixes it. But yeah, need to verify what the version string should be.

bdice commented 2 years ago

@vuule I tried setting the writerVersion to 6 as directed in this protobuf spec: https://github.com/apache/orc/blob/cf720d7b2333618c652445299486cb9600bdbe4f/proto/orc_proto.proto#L402-L435

Commit is here: https://github.com/rapidsai/cudf/commit/1983f7b429a7b46f98827c043c5ede81b0d50305

The orc-metadata tool now sees "writer version": "ORC-135" in dst_timestamp_cudf.orc, but the bug remains when using PyArrow to read the cuDF-generated file:

PyArrow Read Data, dst_timestamp_cudf.orc
               dst_timestamp
0 1981-05-18 22:00:08.262378

Perhaps this is a bug in PyArrow? At least, I think the next step to resolve this will involve reading the PyArrow source.

vuule commented 2 years ago

Went over the minimal repro code - statistics are not read, so the writer version should not affect the output. From this code it looks like pyarrow is affected by the current time zone in the same way for both statistics and column data. IMO this means that the issue is unrelated to statistics/writer version. My only guess ATM is that pyarrow is somehow not reading the file time zone correctly, or not handling the "UTC" time zone correctly on some systems. @bdice can you please try setting a different time zone in https://github.com/rapidsai/cudf/blob/branch-22.02/cpp/src/io/orc/writer_impl.cu#L1966 ? It would be useful to see if it affects the output, and if current time zone still affects the output in that case.

gmarkall commented 2 years ago

I've found that when PyArrow reads the ORC file on my system ("GMT" timezone ID) the time is advanced by exactly one hour for a datetime on 1970-01-01 and one hour and one second for a datetime on 1969-12-31 - this is using values from the test_nanoseconds_overflow in test_orc.py:

In [14]: pyarrow_got.to_pandas()
Out[14]: 
                              s
0 1970-01-01 01:00:00.710424008
1 1970-01-01 00:59:59.661517360

In [15]: expected.to_pandas()
Out[15]: 
                              s
0 1970-01-01 00:00:00.710424008
1 1969-12-31 23:59:58.661517360

I also noticed that the writer is setting the timezone of the writer in the slice footer as UTC regardless of what the current timezone is (and the reader also assumes that it is UTC, so cudf is consistent with itself) - if I make the following change (a quick hack, because my local timezone ID is GMT):

diff --git a/cpp/src/io/orc/writer_impl.cu b/cpp/src/io/orc/writer_impl.cu
index b0e674c206..c69099d8ce 100644
--- a/cpp/src/io/orc/writer_impl.cu
+++ b/cpp/src/io/orc/writer_impl.cu
@@ -1963,7 +1963,7 @@ void writer::impl::write(table_view const& table)
           (sf.columns[i].kind == DICTIONARY_V2)
             ? orc_table.column(i - 1).host_stripe_dict(stripe_id)->num_strings
             : 0;
-        if (orc_table.column(i - 1).orc_kind() == TIMESTAMP) { sf.writerTimezone = "UTC"; }
+        if (orc_table.column(i - 1).orc_kind() == TIMESTAMP) { sf.writerTimezone = "GMT"; }
       }
       buffer_.resize((compression_kind_ != NONE) ? 3 : 0);
       pbw_.write(sf);

then all the ORC tests pass. Prior to making this change, I was seeing:

FAILED test_orc.py::test_orc_write_statistics[100] - AssertionError: assert False
FAILED test_orc.py::test_orc_write_statistics[6000000] - AssertionError: assert False
FAILED test_orc.py::test_nanoseconds_overflow - AssertionError: numpy array are different
FAILED test_orc.py::test_writer_timestamp_stream_size - AssertionError: numpy array are different

I did hypothesise that PyArrow was doing something wrong, but I noticed that orc-contents displayed the same times as the PyArrow-read timestamps:

(rapids) rapids@compose:~/compose$ orc-contents bad-dates.orc 
{"s": "1970-01-01 01:00:00.710424008"}
{"s": "1970-01-01 00:59:59.66151736"}

If I do things the other way and write out an ORC file using Pandas / PyArrow, then read it in to cudf I get wrong timestamps in cudf:

import cudf
import pyarrow as pa
import pandas as pd
from io import BytesIO

buffer = BytesIO()
s = pd.Series([710424008, -1338482640], dtype="datetime64[ns]")
df = pd.DataFrame({"s": s})
table = pa.Table.from_pandas(df)
writer = pa.orc.ORCWriter('arrow_dates.orc')
writer.write(table)
writer.close()

with open('arrow_dates.orc', 'rb') as f:
    cudf_got = cudf.read_orc(f)

with open('arrow_dates.orc', 'rb') as f:
    pyarrow_got = pa.orc.ORCFile(f).read()

print(cudf_got)
print(pyarrow_got.column(0))

gives:

0 1970-01-01 00:00:00.710424008
1 2043-01-25 23:56:46.875211312
[
  [
    1970-01-01 00:00:00.710424008,
    1969-12-31 23:59:58.661517360
  ]
]

I presume the date in 2043 came from some wraparound with nanoseconds (the test the values are from is called test_nanoseconds_overflow, from which I infer that :slightly_smiling_face: )

gmarkall commented 2 years ago

Also, with my patch from the previous comment, the only difference this makes to the written ORC file is to change "UTC" to "GMT", but unfortunately orc-metadata doesn't show the difference:

(rapids) rapids@compose:~/compose$ orc-metadata dates.orc 
{ "name": "dates.orc",
  "type": "struct<s:timestamp>",
  "attributes": {},
  "rows": 2,
  "stripe count": 1,
  "format": "0.12", "writer version": "original", "software version": "ORC Java",
  "compression": "none",
  "file length": 171,
  "content": 82, "stripe stats": 18, "footer": 53, "postscript": 17,
  "row index stride": 10000,
  "user metadata": {
  },
  "stripes": [
    { "stripe": 0, "rows": 2,
      "offset": 3, "length": 79,
      "index": 12, "data": 22, "footer": 45
    }
  ]
}
(rapids) rapids@compose:~/compose$ orc-metadata bad-dates.orc 
{ "name": "bad-dates.orc",
  "type": "struct<s:timestamp>",
  "attributes": {},
  "rows": 2,
  "stripe count": 1,
  "format": "0.12", "writer version": "original", "software version": "ORC Java",
  "compression": "none",
  "file length": 171,
  "content": 82, "stripe stats": 18, "footer": 53, "postscript": 17,
  "row index stride": 10000,
  "user metadata": {
  },
  "stripes": [
    { "stripe": 0, "rows": 2,
      "offset": 3, "length": 79,
      "index": 12, "data": 22, "footer": 45
    }
  ]
}

only orc-contents shows two different sets of values:

(rapids) rapids@compose:~/compose$ orc-contents dates.orc 
{"s": "1970-01-01 00:00:00.710424008"}
{"s": "1969-12-31 23:59:58.66151736"}
(rapids) rapids@compose:~/compose$ orc-contents bad-dates.orc 
{"s": "1970-01-01 01:00:00.710424008"}
{"s": "1970-01-01 00:59:59.66151736"}

image

bdice commented 2 years ago

Here is a zip file containing two ORC files and the minimal Python script to produce them.

dst_timestamp_issue.zip

I can reproduce this failure with orc-contents (provided by orc=1.7.1 on conda-forge) from inside my rapids-compose container, when the system time zone is set to a time zone that observes DST (America/Chicago).

$ orc-contents dst_timestamp_pyarrow.orc  # Expected result, written by PyArrow
{"dst_timestamp": "1981-05-18 21:00:08.262378"}
$ orc-contents dst_timestamp_cudf.orc  # Undesired result, written by cuDF
{"dst_timestamp": "1981-05-18 22:00:08.262378"}

Changing any of the following appears to hide the problem:

I have also eliminated the possibility that this is PyArrow's fault (Arrow dynamically links to liborc.so and has only a thin wrapper around the ORC C++ library for reading/writing, so it's definitely not Arrow's problem).

Next, there are two immediate things to try:

bdice commented 2 years ago

Investigate how ORC performs its timezone conversions and see if there is a difference in the time zone that my container sees vs. what the system outside the container sees...?

The ORC library reads /etc/localtime and /usr/share/zoneinfo when fetching timezone info: https://github.com/apache/orc/blob/043a827783c2d21ebf7f28355f66f0bb97db0834/c%2B%2B/src/Timezone.cc#L32-L36

I tried printing the timezone's information after it is fetched here https://github.com/apache/orc/blob/043a827783c2d21ebf7f28355f66f0bb97db0834/c%2B%2B/src/Reader.cc#L729-L732 by adding these lines below:

std::cout << "Getting writer timezone..." << std::endl;
writerTimezone.print(std::cout);

I got the following:

Getting writer timezone...
Timezone file: /usr/share/zoneinfo/UTC
  Version: 2
  Future rule: CST6CDT,M3.2.0,M11.1.0
  standard CST -21600
  dst CDT -18000 (dst)
  start month 3 week 2 day 0 at 2:0:0
  end month 11 week 1 day 0 at 2:0:0
  Variant 0: LMT -21036
  Variant 1: CDT -18000 (dst)
  Variant 2: CST -21600
  Variant 3: EST -18000
  Variant 4: CWT -18000 (dst)
  Variant 5: CPT -18000 (dst)
  Variant 6: CST -21600
  Transition: 1883-11-18 18:00:00 (-2717647200) -> CST
  Transition: 1918-03-31 08:00:00 (-1633276800) -> CDT
  Transition: 1918-10-27 07:00:00 (-1615136400) -> CST
  Transition: 1919-03-30 08:00:00 (-1601827200) -> CDT
  Transition: 1919-10-26 07:00:00 (-1583686800) -> CST
  Transition: 1920-06-13 08:00:00 (-1563724800) -> CDT
  Transition: 1920-10-31 07:00:00 (-1551632400) -> CST
  Transition: 1921-03-27 08:00:00 (-1538928000) -> CDT
  Transition: 1921-10-30 07:00:00 (-1520182800) -> CST
  Transition: 1922-04-30 08:00:00 (-1504454400) -> CDT
  ...

I conclude that there's something going wrong in the way rapids-compose attempts to match the host time zone through mounting /etc/timezone and /etc/localtime. ORC sees a freaky hybrid time zone that has UTC's name and America/Chicago's DST offsets (CST/CDT). I'll need to experiment a bit more to figure out how to solve that issue. However, uncovering the problem in rapids-compose mounts doesn't let cuDF off the hook just yet, because PyArrow's file can be read correctly. This is likely due to some difference in the metadata.

vyasr commented 6 months ago

@bdice is this reproducible in contexts outside of compose? IIRC we need a specific container configuration to observe this, right? I haven't seen this in a long time (it doesn't show up in devcontainers), but I don't recall the severity of the underlying issue (is it data corruption?) to know whether it's important enough to keep open.

bdice commented 6 months ago

I haven’t seen this outside of compose. I do not see this with devcontainers. It’s possibly some weird behavior involving time zone support in Docker. I’m happy to close this.

gmarkall commented 5 months ago

I also don't see this in devcontainers. Glad it's not showing up anymore, even if it's not clear exactly why!