Closed shwina closed 5 months ago
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
I still see local failures on these tests with the latest branch-22.02.
For col = "col_datetime64[ms]"
I see a difference of an hour:
>>> actual_max
numpy.datetime64('2001-09-09T01:55:11.684000000')
>>> stats_max
datetime.datetime(2001, 9, 9, 0, 55, 11, 684000, tzinfo=datetime.timezone.utc)
I wonder if this is a DST issue or something weird like that. I see the same "off by one hour" bug in test_writer_timestamp_stream_size
as well.
>>> expect.min()
_c0 1679-07-22 14:26:48.910277376
dtype: datetime64[ns]
>>> expect.loc[(expect != got).values].min()
_c0 1918-04-12 16:21:19.290896768
dtype: datetime64[ns]
the United States adopted daylight saving in 1918 https://en.wikipedia.org/wiki/Daylight_saving_time
🤔
I can confirm that all the timestamps that pass the equality check are before 1918 OR (after 1918 AND in months when DST is not active). The timestamps that fail the equality check are all (after 1918 AND in months when DST is active).
Passing:
>>> got.loc[(expect == got).values].head(20)
_c0
0 1726-12-14 05:28:27.950277376
2 1772-10-25 04:57:04.040828992
3 2240-12-06 07:10:00.416241920
7 2043-12-05 05:41:45.725587072
12 2202-11-28 10:44:24.912241920
13 1836-09-30 14:26:55.601483840
14 2221-12-16 02:19:01.204587072
17 1875-02-07 23:34:10.171932224
18 1701-08-10 02:37:27.225483840
19 1756-09-18 23:13:14.606828992
21 1947-04-05 14:43:22.362448384
22 2193-11-03 19:31:02.534345152
23 2071-12-11 15:06:56.374896768
24 2198-12-27 14:37:49.142448384
25 2176-02-28 10:25:24.113828992
27 2228-02-22 14:51:58.464793536
29 2028-11-18 14:42:40.120345152
33 1791-04-04 20:47:05.787793536
34 1697-11-17 00:14:11.477828992
39 1768-05-05 17:35:22.858828992
Failing (ignore the NaT
rows because NaT != NaT
):
>>> got.loc[(expect != got).values].head(20)
_c0
1 2212-05-13 03:48:53.347138688
4 2164-10-05 18:55:36.266345152
5 2047-05-09 18:40:33.687896768
6 2085-09-08 08:24:20.249380608
8 NaT
9 2111-10-25 12:44:13.615896768
10 2043-03-11 09:41:44.077448384
11 1992-06-08 16:43:58.214241920
15 1943-11-21 00:31:30.609587072
16 2217-04-21 22:12:28.090448384
20 2053-08-25 04:43:17.684932224
26 2242-05-13 08:58:41.036138688
28 2015-04-14 04:38:16.134241920
30 NaT
31 2197-07-10 15:46:38.144793536
32 2072-08-14 01:35:57.770380608
35 1951-06-18 22:07:41.504793536
36 2178-04-17 12:26:21.656483840
37 2134-10-24 07:17:33.455896768
38 2101-04-28 05:54:35.066448384
One possible exception: 1943-11-21 00:31:30.609587072
seems weird, right? Well, the United States observed DST all of 1943.
cc: @vuule @devavret
I discussed this with @vuule today. I have produced a minimal test case that disagrees between PyArrow's file reader and cudf.read_orc
. This uses cudf 95909982d2, pandas 1.3.5, and pyarrow 5.0.0.
import cudf
import datetime
import pandas as pd
import pyarrow
# Write an ORC file with a timestamp using cuDF and PyArrow
pdf = pd.DataFrame(
{"dst_timestamp": [pd.Timestamp("1981-05-18 21:00:08.262378")]}
)
cudf.DataFrame(pdf).to_orc("dst_timestamp_cudf.orc")
pyarrow.orc.write_table(
pyarrow.Table.from_pandas(pdf), "dst_timestamp_pyarrow.orc"
)
# Read each file with PyArrow and cuDF
for filename in ("dst_timestamp_cudf.orc", "dst_timestamp_pyarrow.orc"):
orcfile = pyarrow.orc.ORCFile(filename)
pdf = cudf.DataFrame(orcfile.read().to_pandas())
print(f"PyArrow reading {filename}")
print(pdf)
gdf = cudf.read_orc(filename)
print(f"cuDF reading {filename}")
print(gdf)
print(f"Difference (PyArrow - cuDF), {filename}")
print(pdf - gdf)
print()
# I have confirmed that PyArrow and cudf agree if the system is in the UTC time
# zone, but not if the system is in a time zone that observes DST like
# America/Chicago (Central, which is currently CST).
print("Current timezone:")
print(datetime.datetime.now().astimezone().tzinfo)
Output:
PyArrow reading dst_timestamp_cudf.orc
dst_timestamp
0 1981-05-18 22:00:08.262378
cuDF reading dst_timestamp_cudf.orc
dst_timestamp
0 1981-05-18 21:00:08.262378
Difference (PyArrow - cuDF), dst_timestamp_cudf.orc
dst_timestamp
0 0 days 01:00:00
PyArrow reading dst_timestamp_pyarrow.orc
dst_timestamp
0 1981-05-18 21:00:08.262378
cuDF reading dst_timestamp_pyarrow.orc
dst_timestamp
0 1981-05-18 21:00:08.262378
Difference (PyArrow - cuDF), dst_timestamp_pyarrow.orc
dst_timestamp
0 0 days
Current timezone:
CST
From this, we can narrow it down to a specific case: PyArrow does not read the correct data from a file written by cuDF.
I have a hypothesis about the reason for this, from earlier discussion with @vuule. Using the ORC tools, I get the following:
$ orc-contents dst_timestamp_cudf.orc
{"dst_timestamp": "1981-05-18 21:00:08.262378"}
$ orc-contents dst_timestamp_pyarrow.orc
{"dst_timestamp": "1981-05-18 21:00:08.262378"}
The above snippet shows that the orc-contents
(reference?) reader sees the same timestamp in both files. But the metadata differs:
$ orc-metadata dst_timestamp_cudf.orc
{ "name": "dst_timestamp_cudf.orc",
"type": "struct<dst_timestamp:timestamp>",
"attributes": {},
"rows": 1,
"stripe count": 1,
"format": "0.12", "writer version": "original", "software version": "ORC Java",
"compression": "none",
"file length": 192,
"content": 75, "stripe stats": 26, "footer": 73, "postscript": 17,
"row index stride": 10000,
"user metadata": {
},
"stripes": [
{ "stripe": 0, "rows": 1,
"offset": 3, "length": 72,
"index": 12, "data": 15, "footer": 45
}
]
}
$ orc-metadata dst_timestamp_pyarrow.orc
{ "name": "dst_timestamp_pyarrow.orc",
"type": "struct<dst_timestamp:timestamp>",
"attributes": {},
"rows": 1,
"stripe count": 1,
"format": "0.12", "writer version": "ORC-135", "software version": "ORC C++ 1.7.1",
"compression": "zlib", "compression block": 65536,
"file length": 294,
"content": 130, "stripe stats": 37, "footer": 100, "postscript": 23,
"row index stride": 10000,
"user metadata": {
},
"stripes": [
{ "stripe": 0, "rows": 1,
"offset": 3, "length": 130,
"index": 53, "data": 27, "footer": 50
}
]
}
Notably, the file written by pyarrow says "writer version": "ORC-135"
while the file written by cuDF says "writer version": "original"
.
The ORCv1 spec is here and notes,
Before ORC-135, the local timezone offset was included and they were stored as
minimum
andmaximum
. After ORC-135, the timestamp is adjusted to UTC before being converted to milliseconds and stored inminimumUtc
andmaximumUtc
.
Here's the issue for ORC-135.
I hope this is helpful diagnostic information - I am not yet sure where to look next. I am also unsure about whether to treat this as a bug in cuDF's metadata writing, or a bug in pyarrow's metadata reading.
Thanks @bdice! We can try setting the version to "ORC-135" in the writer, see if that fixes it. But yeah, need to verify what the version string should be.
@vuule I tried setting the writerVersion
to 6
as directed in this protobuf spec: https://github.com/apache/orc/blob/cf720d7b2333618c652445299486cb9600bdbe4f/proto/orc_proto.proto#L402-L435
Commit is here: https://github.com/rapidsai/cudf/commit/1983f7b429a7b46f98827c043c5ede81b0d50305
The orc-metadata
tool now sees "writer version": "ORC-135"
in dst_timestamp_cudf.orc
, but the bug remains when using PyArrow to read the cuDF-generated file:
PyArrow Read Data, dst_timestamp_cudf.orc
dst_timestamp
0 1981-05-18 22:00:08.262378
Perhaps this is a bug in PyArrow? At least, I think the next step to resolve this will involve reading the PyArrow source.
Went over the minimal repro code - statistics are not read, so the writer version should not affect the output. From this code it looks like pyarrow is affected by the current time zone in the same way for both statistics and column data. IMO this means that the issue is unrelated to statistics/writer version. My only guess ATM is that pyarrow is somehow not reading the file time zone correctly, or not handling the "UTC" time zone correctly on some systems. @bdice can you please try setting a different time zone in https://github.com/rapidsai/cudf/blob/branch-22.02/cpp/src/io/orc/writer_impl.cu#L1966 ? It would be useful to see if it affects the output, and if current time zone still affects the output in that case.
I've found that when PyArrow reads the ORC file on my system ("GMT" timezone ID) the time is advanced by exactly one hour for a datetime on 1970-01-01 and one hour and one second for a datetime on 1969-12-31 - this is using values from the test_nanoseconds_overflow
in test_orc.py
:
In [14]: pyarrow_got.to_pandas()
Out[14]:
s
0 1970-01-01 01:00:00.710424008
1 1970-01-01 00:59:59.661517360
In [15]: expected.to_pandas()
Out[15]:
s
0 1970-01-01 00:00:00.710424008
1 1969-12-31 23:59:58.661517360
I also noticed that the writer is setting the timezone of the writer in the slice footer as UTC regardless of what the current timezone is (and the reader also assumes that it is UTC, so cudf is consistent with itself) - if I make the following change (a quick hack, because my local timezone ID is GMT):
diff --git a/cpp/src/io/orc/writer_impl.cu b/cpp/src/io/orc/writer_impl.cu
index b0e674c206..c69099d8ce 100644
--- a/cpp/src/io/orc/writer_impl.cu
+++ b/cpp/src/io/orc/writer_impl.cu
@@ -1963,7 +1963,7 @@ void writer::impl::write(table_view const& table)
(sf.columns[i].kind == DICTIONARY_V2)
? orc_table.column(i - 1).host_stripe_dict(stripe_id)->num_strings
: 0;
- if (orc_table.column(i - 1).orc_kind() == TIMESTAMP) { sf.writerTimezone = "UTC"; }
+ if (orc_table.column(i - 1).orc_kind() == TIMESTAMP) { sf.writerTimezone = "GMT"; }
}
buffer_.resize((compression_kind_ != NONE) ? 3 : 0);
pbw_.write(sf);
then all the ORC tests pass. Prior to making this change, I was seeing:
FAILED test_orc.py::test_orc_write_statistics[100] - AssertionError: assert False
FAILED test_orc.py::test_orc_write_statistics[6000000] - AssertionError: assert False
FAILED test_orc.py::test_nanoseconds_overflow - AssertionError: numpy array are different
FAILED test_orc.py::test_writer_timestamp_stream_size - AssertionError: numpy array are different
I did hypothesise that PyArrow was doing something wrong, but I noticed that orc-contents
displayed the same times as the PyArrow-read timestamps:
(rapids) rapids@compose:~/compose$ orc-contents bad-dates.orc
{"s": "1970-01-01 01:00:00.710424008"}
{"s": "1970-01-01 00:59:59.66151736"}
If I do things the other way and write out an ORC file using Pandas / PyArrow, then read it in to cudf I get wrong timestamps in cudf:
import cudf
import pyarrow as pa
import pandas as pd
from io import BytesIO
buffer = BytesIO()
s = pd.Series([710424008, -1338482640], dtype="datetime64[ns]")
df = pd.DataFrame({"s": s})
table = pa.Table.from_pandas(df)
writer = pa.orc.ORCWriter('arrow_dates.orc')
writer.write(table)
writer.close()
with open('arrow_dates.orc', 'rb') as f:
cudf_got = cudf.read_orc(f)
with open('arrow_dates.orc', 'rb') as f:
pyarrow_got = pa.orc.ORCFile(f).read()
print(cudf_got)
print(pyarrow_got.column(0))
gives:
0 1970-01-01 00:00:00.710424008
1 2043-01-25 23:56:46.875211312
[
[
1970-01-01 00:00:00.710424008,
1969-12-31 23:59:58.661517360
]
]
I presume the date in 2043 came from some wraparound with nanoseconds (the test the values are from is called test_nanoseconds_overflow
, from which I infer that :slightly_smiling_face: )
Also, with my patch from the previous comment, the only difference this makes to the written ORC file is to change "UTC" to "GMT", but unfortunately orc-metadata
doesn't show the difference:
(rapids) rapids@compose:~/compose$ orc-metadata dates.orc
{ "name": "dates.orc",
"type": "struct<s:timestamp>",
"attributes": {},
"rows": 2,
"stripe count": 1,
"format": "0.12", "writer version": "original", "software version": "ORC Java",
"compression": "none",
"file length": 171,
"content": 82, "stripe stats": 18, "footer": 53, "postscript": 17,
"row index stride": 10000,
"user metadata": {
},
"stripes": [
{ "stripe": 0, "rows": 2,
"offset": 3, "length": 79,
"index": 12, "data": 22, "footer": 45
}
]
}
(rapids) rapids@compose:~/compose$ orc-metadata bad-dates.orc
{ "name": "bad-dates.orc",
"type": "struct<s:timestamp>",
"attributes": {},
"rows": 2,
"stripe count": 1,
"format": "0.12", "writer version": "original", "software version": "ORC Java",
"compression": "none",
"file length": 171,
"content": 82, "stripe stats": 18, "footer": 53, "postscript": 17,
"row index stride": 10000,
"user metadata": {
},
"stripes": [
{ "stripe": 0, "rows": 2,
"offset": 3, "length": 79,
"index": 12, "data": 22, "footer": 45
}
]
}
only orc-contents
shows two different sets of values:
(rapids) rapids@compose:~/compose$ orc-contents dates.orc
{"s": "1970-01-01 00:00:00.710424008"}
{"s": "1969-12-31 23:59:58.66151736"}
(rapids) rapids@compose:~/compose$ orc-contents bad-dates.orc
{"s": "1970-01-01 01:00:00.710424008"}
{"s": "1970-01-01 00:59:59.66151736"}
Here is a zip file containing two ORC files and the minimal Python script to produce them.
I can reproduce this failure with orc-contents
(provided by orc=1.7.1
on conda-forge) from inside my rapids-compose container, when the system time zone is set to a time zone that observes DST (America/Chicago).
$ orc-contents dst_timestamp_pyarrow.orc # Expected result, written by PyArrow
{"dst_timestamp": "1981-05-18 21:00:08.262378"}
$ orc-contents dst_timestamp_cudf.orc # Undesired result, written by cuDF
{"dst_timestamp": "1981-05-18 22:00:08.262378"}
Changing any of the following appears to hide the problem:
Etc/UTC
rather than America/Chicago
passesorc-contents
(provided by orc=1.7.1
on conda-forge) outside the container passesI have also eliminated the possibility that this is PyArrow's fault (Arrow dynamically links to liborc.so
and has only a thin wrapper around the ORC C++ library for reading/writing, so it's definitely not Arrow's problem).
Next, there are two immediate things to try:
Investigate how ORC performs its timezone conversions and see if there is a difference in the time zone that my container sees vs. what the system outside the container sees...?
The ORC library reads /etc/localtime
and /usr/share/zoneinfo
when fetching timezone info: https://github.com/apache/orc/blob/043a827783c2d21ebf7f28355f66f0bb97db0834/c%2B%2B/src/Timezone.cc#L32-L36
I tried printing the timezone's information after it is fetched here https://github.com/apache/orc/blob/043a827783c2d21ebf7f28355f66f0bb97db0834/c%2B%2B/src/Reader.cc#L729-L732 by adding these lines below:
std::cout << "Getting writer timezone..." << std::endl;
writerTimezone.print(std::cout);
I got the following:
Getting writer timezone...
Timezone file: /usr/share/zoneinfo/UTC
Version: 2
Future rule: CST6CDT,M3.2.0,M11.1.0
standard CST -21600
dst CDT -18000 (dst)
start month 3 week 2 day 0 at 2:0:0
end month 11 week 1 day 0 at 2:0:0
Variant 0: LMT -21036
Variant 1: CDT -18000 (dst)
Variant 2: CST -21600
Variant 3: EST -18000
Variant 4: CWT -18000 (dst)
Variant 5: CPT -18000 (dst)
Variant 6: CST -21600
Transition: 1883-11-18 18:00:00 (-2717647200) -> CST
Transition: 1918-03-31 08:00:00 (-1633276800) -> CDT
Transition: 1918-10-27 07:00:00 (-1615136400) -> CST
Transition: 1919-03-30 08:00:00 (-1601827200) -> CDT
Transition: 1919-10-26 07:00:00 (-1583686800) -> CST
Transition: 1920-06-13 08:00:00 (-1563724800) -> CDT
Transition: 1920-10-31 07:00:00 (-1551632400) -> CST
Transition: 1921-03-27 08:00:00 (-1538928000) -> CDT
Transition: 1921-10-30 07:00:00 (-1520182800) -> CST
Transition: 1922-04-30 08:00:00 (-1504454400) -> CDT
...
I conclude that there's something going wrong in the way rapids-compose attempts to match the host time zone through mounting /etc/timezone
and /etc/localtime
. ORC sees a freaky hybrid time zone that has UTC's name and America/Chicago's DST offsets (CST/CDT). I'll need to experiment a bit more to figure out how to solve that issue. However, uncovering the problem in rapids-compose mounts doesn't let cuDF off the hook just yet, because PyArrow's file can be read correctly. This is likely due to some difference in the metadata.
@bdice is this reproducible in contexts outside of compose? IIRC we need a specific container configuration to observe this, right? I haven't seen this in a long time (it doesn't show up in devcontainers), but I don't recall the severity of the underlying issue (is it data corruption?) to know whether it's important enough to keep open.
I haven’t seen this outside of compose. I do not see this with devcontainers. It’s possibly some weird behavior involving time zone support in Docker. I’m happy to close this.
I also don't see this in devcontainers. Glad it's not showing up anymore, even if it's not clear exactly why!
The following tests fail reliably for me:
Stack trace: