pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.14k stars 1.83k forks source link

"Unable to parse time zone" when reading a specific Parquet file (Parquet doesn't store time zones) #9586

Closed sm-Fifteen closed 1 year ago

sm-Fifteen commented 1 year ago

Polars version checks

Issue description

I have a parquet file generated by an external program that Polars is having trouble with with because of a timezone issue. I'm really not sure what's wrong with this file, because I believe Parquet stores its timestamps as int64 values and doesn't record any timezone information besides whether it's naïve or UTC. Polars can read it just fine, but if I try to work on the resulting dataframe, I get a pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: "'+00:00' is not a valid timezone" error or something similar, depending on the operation.

Here's the file in question, sample_weird_tz.parquet, which has been reduced to one row and one column to better show the problem, alongside another parquet file containing an identical but working dataframe, generated by Polars, sample_df_polars_identical.parquet. That second file gets read just fine and shows that it's a problem specific to something about sample_weird_tz, not Parquet file reading as a whole.

Reproducible example

import polars

parquet_df = polars.read_parquet(r"sample_weird_tz.parquet")
print(parquet_df)
"""
shape: (1, 1)
┌──────────────────────┐
│ UTC_DATETIME_ID      │
│ ---                  │
│ datetime[ns, +00:00] │
╞══════════════════════╡
│ invalid timezone     │
└──────────────────────┘
"""
parquet_df.with_columns([polars.col("UTC_DATETIME_ID").dt.replace_time_zone("UTC")])
# "exceptions.ComputeError: unable to parse time zone: '+00:00'"

import pandas
fixed_parquet_df = polars.from_pandas(pandas.read_parquet("sample_weird_tz.parquet"))
print(fixed_parquet_df )
"""
shape: (1, 1)
┌─────────────────────────┐
│ UTC_DATETIME_ID         │
│ ---                     │
│ datetime[ns, UTC]       │
╞═════════════════════════╡
│ 2023-06-26 14:15:00 UTC │
└─────────────────────────┘
"""

Expected behavior

The specific error from my code example points to this line:

https://github.com/pola-rs/polars/blob/44c093cd5f763555888e5e02765886b2f66b360f/polars/polars-arrow/src/kernels/time.rs#L105

I would certainly not expect Polars to panic when trying to manipulate the dataframe coming out of this file. Pandas (or rather fastparquet) seems to be able to read it just fine.

Installed versions

``` --------Version info--------- Polars: 0.18.2 Index type: UInt32 Platform: Windows-10-10.0.14393-SP0 Python: 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] ----Optional dependencies---- numpy: 1.22.4 pandas: 2.0.2 pyarrow: connectorx: deltalake: fsspec: matplotlib: 3.4.3 xlsx2csv: xlsxwriter: 0.9.6 ```
MarcoGorelli commented 1 year ago

thanks for the report - looks like use_pyarrow=True handles this one:

In [3]: parquet_df = polars.read_parquet(r"sample_weird_tz.parquet", use_pyarrow=True)
   ...: print(parquet_df)
shape: (1, 1)
┌─────────────────────────┐
│ UTC_DATETIME_ID         │
│ ---                     │
│ datetime[ns, UTC]       │
╞═════════════════════════╡
│ 2023-06-26 14:15:00 UTC │
└─────────────────────────┘

will take a look

MarcoGorelli commented 1 year ago

Right, looks like the file's schema contains an offset in the place of a time zone:

In [16]: pl.read_parquet_schema("sample_weird_tz.parquet")
Out[16]: {'UTC_DATETIME_ID': Datetime(time_unit='ns', time_zone='+00:00')}

This should probably error earlier, rather than parsing with an invalid schema

MarcoGorelli commented 1 year ago

Something you could do for now is to override the schema:

In [20]: pl.scan_parquet("sample_weird_tz.parquet").with_columns(pl.col('UTC_DATETIME_ID').cast(pl.Datetime('ns', 'UTC'))).collect()
Out[20]:
shape: (1, 1)
┌─────────────────────────┐
│ UTC_DATETIME_ID         │
│ ---                     │
│ datetime[ns, UTC]       │
╞═════════════════════════╡
│ 2023-06-26 14:15:00 UTC │
└─────────────────────────┘
sm-Fifteen commented 1 year ago

Right, looks like the file's schema contains an offset in the place of a time zone:

In [16]: pl.read_parquet_schema("sample_weird_tz.parquet")
Out[16]: {'UTC_DATETIME_ID': Datetime(time_unit='ns', time_zone='+00:00')}

This should probably error earlier, rather than parsing with an invalid schema

That doesn't sound right,.. Parquet can't store timezone info according to their own doc, as far as I can tell. Are you sure it's not added by Arrow-rs at some point when reading?

MarcoGorelli commented 1 year ago

ooh, well-spotted, thanks!

MarcoGorelli commented 1 year ago

I mean, it would be arrow2, but still, this line

https://github.com/pola-rs/polars/blob/3282c744bbaeca382fcbf865a45f2f5a6eb4f65f/polars/polars-io/src/parquet/read.rs#L193

is already making a Field with datatype Timestamp(Nanosecond, Some("+00:00"))

sm-Fifteen commented 1 year ago

@MarcoGorelli: Arrow says that offset are just as valid timezone specifiers as IANA timezone names and that +00:00 is just as valid as UTC or Etc/UTC. Fixing the issue in arrow2 won't neccessarily keep it from coming up in various ways later, if the root cause is that Polars can't handle timezone specifiers that Arrow considers valid.

MarcoGorelli commented 1 year ago

Thanks for the link

It's intentional to only support IANA time zone names, see: https://github.com/pola-rs/polars/issues/9103#issuecomment-1567479148

If it's only for the sake of read_parquet, then maybe this can be worked around within polars

If other issues come up, then maybe FixedOffset timezones will need to come back, but I'm hoping we don't need to get there

sm-Fifteen commented 1 year ago

It's intentional to only support IANA time zone names, see: #9103 (comment)

If it's only for the sake of read_parquet, then maybe this can be worked around within polars

If other issues come up, then maybe FixedOffset timezones will need to come back, but I'm hoping we don't need to get there

I can understand why fixed offsets might cause issues, espeically among something as complicated as date handling. There's always the possibility of handling these like Parquet does, by just applying the offset so they're all UTC and losing the offset information. I can't imagine such offsets would be that common of an occurence anyway, and having a special case to alias +00:00 to UTC for the performance would probably work well enough as an alternative to just rejecting them outright.

MarcoGorelli commented 1 year ago

by just applying the offset so they're all UTC

yup! this is what's done in to_datetime

the issue is that here '+00:00' is part of the time zone...

doing this works

--- a/polars/polars-core/src/series/from.rs
+++ b/polars/polars-core/src/series/from.rs
@@ -198,8 +198,9 @@ impl Series {
                 let mut tz = tz.clone();
                 if tz.as_deref() == Some("") {
                     tz = None;
+                } else if tz.as_deref() == Some("+00:00") {
+                    tz = Some("UTC".to_string());
                 }
MarcoGorelli commented 1 year ago

Parquet can't store timezone info according to their own doc, as far as I can tell. Are you sure it's not added by Arrow-rs at some point when reading?

Will look into this, but for the second file, it's read just fine:

In [31]: pl.read_parquet_schema("sample_weird_tz.parquet")
Out[31]: {'UTC_DATETIME_ID': Datetime(time_unit='ns', time_zone='+00:00')}

In [32]: pl.read_parquet_schema("sample_df_polars_identical.parquet")
Out[32]: {'UTC_DATETIME_ID': Datetime(time_unit='us', time_zone='UTC')}
MarcoGorelli commented 1 year ago

I think it does store timezone info:

In [38]: pd.DataFrame({'ts': pd.date_range('2000', periods=3, tz='+01:00')}).to_parquet('tmp.parquet')

In [39]: pl.read_parquet_schema("tmp.parquet")
Out[39]: {'ts': Datetime(time_unit='ns', time_zone='+01:00')}

So I think the fix here is to raise a loud and informative error if the time zone is one that chrono-tz can't parse, informing users of how to work around it

kavorite commented 1 year ago

So I think the fix here is to raise a loud and informative error if the time zone is one that chrono-tz can't parse, informing users of how to work around it

A "loud and informative" error is what I just got in my code that has been working for months now where I explicitly try to convert the timezone to None from inside of my Python script. The .cast() invocation that you suggested earlier is not working either, which means I have to fall back to reading the file eagerly and do two passes until this is fixed. The error is loud and informative, indeed, but aside from bringing me to this issue, it's also unhelpful.

MarcoGorelli commented 1 year ago

thanks for the report, I'll see what can be done

MarcoGorelli commented 1 year ago

@kavorite just for my understanding, how are you ending up with +00:00 in the parquet dtype in the first place?

kavorite commented 1 year ago

@MarcoGorelli encoded using parquet-go

MarcoGorelli commented 1 year ago

thanks! does that one also produce dtype with timezones like +01:00, or is it just +00:00 that it may produce for UTC?

kavorite commented 1 year ago

only UTC afaik, I've actually tried using it with timezone-aware time.Time instances in my encoder, but it seems to encode them naively regardless

MarcoGorelli commented 1 year ago

thanks - OK may it's fine to just convert +00:00 to UTC internally then, just as is done for "" => None

MarcoGorelli commented 1 year ago

@sm-Fifteen how about you, how did you end up with +00:00 in the dtype in the parquet file? Is it only +00:00 that needs to be taken care of?

MarcoGorelli commented 1 year ago

@MarcoGorelli: Arrow says that offset are just as valid timezone specifiers as IANA timezone names and that +00:00 is just as valid as UTC or Etc/UTC

Are you sure this is always true? For example, looks like pyarrow doesn't allow '+00:00' for arrays

In [55]: pc.assume_timezone(pa.array([datetime(2020, 1, 1)]), '+00:00')
---------------------------------------------------------------------------
ArrowInvalid: Cannot locate timezone '+00:00': +00:00 not found in timezone database

@kavorite would it be possible to save using UTC as the column's timezone in parquet-go while or before encoding?

It wouldn't technically be difficult to map '+00:00' to 'UTC' within polars, I'm just trying to be very careful before doing such special-casing

sm-Fifteen commented 1 year ago

@sm-Fifteen how about you, how did you end up with +00:00 in the dtype in the parquet file? Is it only +00:00 that needs to be taken care of?

It's the output of some big ETL data processing tool that uses "parquet-cpp-arrow version 7.0.0" under the hood, judging by the file footer. It's at least partly a bug on their part, because my source data actually contained naïve datetimes, and the tool in question assumed (correctly, as it happens) that these were UTC timestamps. I don't know how they configured the writer on their end, but the Parquet decoder Polars uses seems to somehow recognize a fixed offset of "+00:00" in the metadata. I'm not entirely sure if it's because there's some metadata field in the file that actually indicates this specific offset or if it's injected by the decoder.

kavorite commented 1 year ago

@MarcoGorelli

It wouldn't technically be difficult to map '+00:00' to 'UTC' within polars, I'm just trying to be very careful before doing such special-casing

i don’t know if that’s what pyarrow does internally, but i do know that if I use the pyarrow decoder then loading this file Just Works, so I suspect the condition you used may not check whether date times are valid, but instead whether date time conversion is. Also recall that there are ways to correct my pain points beyond just special-casing, even if the “special case” is an explicit carve out made in other implementations. If the interface to override time zone information altogether actually did its job, none of this would be a problem for me. The problem isn’t that polars is crashing, the problem is that the fix to make polars’ “bad behavior” explicit in this case also won’t make it work. I can’t even cast to the physical type and back: Despite any protests, polars always finishes having its crisis of identity over my data that is actually serialized correctly before ever encountering these explicit conversions and hence the entire lazy loading API becomes inaccessible anyway. Isn’t that also technically applicable to my poor experience? It doesn’t seem that fixing such a thing requires any “special casing.”

MarcoGorelli commented 1 year ago

thanks for your responses

this is what I meant by special-casing: https://github.com/pola-rs/polars/pull/9747/files


The issue with not validating the time zone right away and just letting it be written by the lazy interface is that, as reported in the first message here, you might end up with

shape: (1, 1)
┌──────────────────────┐
│ UTC_DATETIME_ID      │
│ ---                  │
│ datetime[ns, +00:00] │
╞══════════════════════╡
│ invalid timezone     │
└──────────────────────┘

, on which all datetime computations would fail (though maybe there is a way around this?)

I'm suggesting that just treating +00:00 and UTC as the same would probably be the most user-friendly here, especially if users end up with +00:00 for reasons outside their control (e.g. if other software puts that as the time zone for UTC)

kavorite commented 1 year ago

, on which all datetime computations would fail (though maybe there is a way around this?)

last time I checked I was able to successfully convert this to a None time zone and then replace with the one that I knew was applicable a priori, and after that all of the normal time series operations worked fine

there was some regression between 0.18.1 and 0.18.5 that made this code just crash

MarcoGorelli commented 1 year ago

Yeah it was due to #9598, which prevented invalid time zones from being present when loading

If +00:00 is the only time zone that needs to be taken care of, then I think #9747 would solve the issue. Then, you could just read your parquet file without issues

Could you confirm whether +00:00 is the only timezone which may end up with when reading a parquet file which is causing issues?

If not, then #9598 could just be reverted, no big deal

kavorite commented 1 year ago

+00:00 is the only one that my application needs for the moment

sm-Fifteen commented 1 year ago

Yeah it was due to #9598, which prevented invalid time zones from being present when loading

The problem is that "+00:00" isn't actually invalid. It's perfectly valid according to the Arrow spec, it's just that Polars doesn't handle it. The mistake here was assuming that any timezone specifier that Polars can't handle is invalid.

MarcoGorelli commented 1 year ago

Ok, it prevented time zones not present in the time zone database from being present when loading

Point is - is it only +00:00 we need to worry about, because some libraries use that to encode UTC? If so, https://github.com/pola-rs/polars/pull/9747 is a simple fix. If not, then https://github.com/pola-rs/polars/pull/9598 can be reverted

I'd suggest starting with #9747 , and then, if other issues arise, just revert #9598

sm-Fifteen commented 1 year ago

@MarcoGorelli: Arrow says that offset are just as valid timezone specifiers as IANA timezone names and that +00:00 is just as valid as UTC or Etc/UTC

Are you sure this is always true? For example, looks like pyarrow doesn't allow '+00:00' for arrays

In [55]: pc.assume_timezone(pa.array([datetime(2020, 1, 1)]), '+00:00')
---------------------------------------------------------------------------
ArrowInvalid: Cannot locate timezone '+00:00': +00:00 not found in timezone database

FYI, it's referred to as an "Offset date-time" in the stream format spec. It's very much part of the Arrow format (regardless of whether or not it's part of the Parquet format).

EDIT: Actually, given the way datetimes are stored in Arrow, you don't even need to understand the timezone string (which can be arbitrary and has room to be extended) in order to work with the dates. To try and be flexible with the data you accept, could just have an operation (dt.use_utc_value()??) or a flag to tell Polars to assume unknown timezone offsets should be ignored and to only consider the UTC value of the data.

Timestamps with a non-empty timezone

If a Timestamp column has a non-empty timezone value, its epoch is 1970-01-01 00:00:00 (January 1st 1970, midnight) in the UTC timezone (the Unix epoch), regardless of the Timestamp's own timezone.

Therefore, timestamp values with a non-empty timezone correspond to physical points in time together with some additional information about how the data was obtained and/or how to display it (the timezone).

For example, the timestamp value 0 with the timezone string "Europe/Paris" corresponds to "January 1st 1970, 00h00" in the UTC timezone, but the application may prefer to display it as "January 1st 1970, 01h00" in the Europe/Paris timezone (which is the same physical point in time).

One consequence is that timestamp values with a non-empty timezone can be compared and ordered directly, since they all share the same well-known point of reference (the Unix epoch).

[...]

Conversion between timezones

If a Timestamp column has a non-empty timezone, changing the timezone to a different non-empty value is a metadata-only operation: the timestamp values need not change as their point of reference remains the same (the Unix epoch).

Timezone info is mainly intended for printing and shouldn't affect how Polars handles anything, AFAICT.

MarcoGorelli commented 1 year ago

Are you sure? As far I can tell from the link you posted, it's part of the schema for a single Timestamp

e.g. pyarrow allows a scalar to have a fixed offset

In [10]: pa.scalar(datetime(2012, 1, 1), type=pa.timestamp('s', tz='+01:00'))
Out[10]: <pyarrow.TimestampScalar: datetime.datetime(2012, 1, 1, 1, 0, tzinfo=pytz.FixedOffset(60))>

but it doesn't allow an array to have that as the dtype's timezone:

In [13]: pc.assume_timezone(pa.array([datetime(2020, 1, 1)]), '+00:00')
---------------------------------------------------------------------------
ArrowInvalid: Cannot locate timezone '+00:00': +00:00 not found in timezone database
MarcoGorelli commented 1 year ago

Timezone info is mainly intended for printing and shouldn't affect how Polars handles anything, AFAICT.

Not sure what you mean, sorry. Does't timezone info also affect daylight savings offsets and calendar durations (such as "1 day", which may be 23 hours, 24 hours, or 25 hours, depending on whether DST is being crossed)?

sm-Fifteen commented 1 year ago

Timezone info is mainly intended for printing and shouldn't affect how Polars handles anything, AFAICT.

Not sure what you mean, sorry. Does't timezone info also affect daylight savings offsets and calendar durations (such as "1 day", which may be 23 hours, 24 hours, or 25 hours, depending on whether DST is being crossed)?

According to the Arrow format definitions, all timezone-aware timestamps are stored as "instants" (64-bit signed integer of s/ms/µs/ns elapsed since the UNIX epoch) and can be assumed to all be UTC TAI (because UTC acknowledges leap seconds and POSIX time is just a linear number of seconds), with the timezone string being considered as informative and possibly subject to future extension. With regards to how they affect days-to-seconds duration handling, though, you may have a point, but even then, that's not true of all or even most operations.

Are you sure? As far I can tell from the link you posted, it's part of the schema for a single Timestamp

I wonder how PyArrow handles standalone scalars; as far as I can tell that's not something Arrow does. From what I'm reading, a "scalar timestamp" with no column metadata would just be an integer assumed to be a number of seconds since the UNIX epoch. Maybe it's just a one-row, one-column dataframe?

EDIT: It's apparently a wrapped integer scalar and timestamp metadata (unit and tz) stored together with a bunch of timezone parsing being done just for the Python bindings so the timestamp can be returned as a Python datatime if needed. I wasn't quite right with my one-row column, but it's functionally the same thing. PyArrow's scalar function could therefore throw if you were to use a nonsensical timezone, but mainly because it needs to be able to return valid Python datetimes.

MarcoGorelli commented 1 year ago

Thanks for looking into this

but even then, that's not true of all or even most operations.

Sure, but it's true for replace_time_zone (which is the one you originally reported), right?

sm-Fifteen commented 1 year ago

e.g. pyarrow allows a scalar to have a fixed offset

In [10]: pa.scalar(datetime(2012, 1, 1), type=pa.timestamp('s', tz='+01:00'))
Out[10]: <pyarrow.TimestampScalar: datetime.datetime(2012, 1, 1, 1, 0, tzinfo=pytz.FixedOffset(60))>

but it doesn't allow an array to have that as the dtype's timezone:

In [13]: pc.assume_timezone(pa.array([datetime(2020, 1, 1)]), '+00:00')
---------------------------------------------------------------------------
ArrowInvalid: Cannot locate timezone '+00:00': +00:00 not found in timezone database

@MarcoGorelli: That might actually be an issue with PyArrow. The assume_timezone doc doesn't mention it, but the error message matches LocateZone, leading to here. LocateZone seemgly only tries to match against tzdb to validate the timezone, so it would fail when encountering time offsets. That looks like it's worth reporting.

Thanks for looking into this

but even then, that's not true of all or even most operations.

Sure, but it's true for replace_time_zone (which is the one you originally reported), right?

No, because replace_time_zone() is supposed to ignore the existing timezone data and assume You're right, actually, it assumes the naïve representation doesn't change, not the moment referred to by the timestamp. That's actually a fairly subtle distinction, come to think of it. replace_time_zone() just happenned to be the simplest way to demonstrate the issue, but I actually encountered it when tryung to convert timestamps to a specific machine-readable string format for outputting.

Assuming timestamps could be allowed to contain unknown (by either Polars or PyArrow) timezones or even technically arbitrary strings as part of the timezone descriptor is some kind of "ignore unknown timezone name and only UTC value" escape hatch, but communicating this properly to users would be tricky (because "ignoring timezone" tends to mean you fall back to naïve, not to a stored UTC value), and a simple warning could lead to strange pitfalls when the date-time representation becomes important (mainly formatting and conversion to naïve date-times).

Or I might just be otherthinking this given the initial issue was "+00:00 isn't recognized as UTC".

sm-Fifteen commented 1 year ago

@MarcoGorelli: Should this be left open with regards to more general handling of timeone strings, or would it be preferable to move that to a separate issue, if that's even considered something Polars should want to handle? Or should we just reccomend that users with any other timezone specifier just cast their column to any IANA timezone like you suggested here and not try to handle strange cases within the library?

MarcoGorelli commented 1 year ago

do you mean, having fixed offsets in the datatype? that was intentionally removed some time ago https://github.com/pola-rs/polars/issues/8860

not having fixed offsets as datatypes has several advantages, see https://github.com/pola-rs/polars/issues/9103#issuecomment-1567479148

do you have a specific use case where you need them?

sm-Fifteen commented 1 year ago

do you mean, having fixed offsets in the datatype? that was intentionally removed some time ago #8860

do you have a specific use case where you need them?

More like having an escape hatch for otherwise invalid (either fixed offsets or actually unknown string values) timezone specifiers. I don't really have a use case for those myself, I'm really more asking for the sake of robustness and not just panicking when encountering an unrecognized string.

not having fixed offsets as datatypes has several advantages, see #9103 (comment)

I understand that columns can only have a single timezone (for fairly good reasons), which saves having to normalize these when constructing dataframes and having to guess what the original timezone was, but I don't believe not supporting these to the point of panicking when a fixed offset is encountered is the best solution there is.

For the "every timestamp in this column could have its own offset" situation, I could see having some sort of write_source_tz_to_col={'dt_col1': 'tz_col1', 'dt_col2': 'tz_col2'} parameter (working title, I know it's not great), to allow Polars to still agressively normalize fixed/incoherent/heterogenous offset datetimes to UTC when converting row data to column data, while also not losing the original timezone info in case users want to reapply the offset later when rendering the value into a localized date-time. A motivated user could even use that to assert that the fixed offset data actually corresponds to the offsets of the 'Europe/Amsterdam' zone and apply it to the column.

sm-Fifteen commented 10 months ago

@MarcoGorelli: I'm bringing this back up given you've pinged me on https://github.com/pola-rs/polars/issues/12149#issuecomment-1790940258 and that other issue also has to do with handling timezone offsets, where being able to keep track of how they vary on a per row basis can be important. The other issue is about truncating timestamps on the DST fold and handling ambiguous values, where the entire column is set to a IANA timezone that respects some form of DST. It makes sense to perform truncation by going through naïve datetime conversion, but if "offset-only" timestamps are all normalized to UTC on load, they would all get truncted relative to UTC and the user would have no means of performing the operation while accounting for their original offset.

I'm considering raising this as a new issue instead, but I still think having some mean of splitting and storing the original timezone specifier/offset of each row in a separate column for operations that may need it to be applied at some point, like rounding and formatting, (and maybe allowing an "external timestamp offset column" to be specified for said operations) would prove useful when dealing with timestamps in row-based formats.

inigohidalgo commented 7 months ago

Hi!

Not sure if I should open a new issue, as a lot of the discussion in this thread goes over my head, but I am having a problem with a similar issue:

A pandas dataframe is being saved to parquet. It contains a column of type datetime64[ns, pytz.FixedOffset(60)] which, when reading using polars lazy mode is throwing the following error

ComputeError: unable to parse time zone: '+01:00'. Please check the Time Zone Database for a list of available time zones

Is this part of the same issue you were discussing here? Should I open a new one?

Thanks

MarcoGorelli commented 7 months ago

thanks, should be addressed by https://github.com/pola-rs/polars/pull/13738

but I'd really suggest, when you create your pandas dataframe, to set the time zone to the Area/Location one which your timestamps come from

inigohidalgo commented 7 months ago

Thank you very much for the suggestion, that is definitely the most correct way to go about things. It is the result of an extraction from a system through pandas and directly dumped to parquet, but I will see if I can get that change included there.