Open david-waterworth opened 11 months ago
Sounds like a good suggestion, using a standardised string language is probably better for the ecosystem than using a Polars-only string language
Most of it would be quite low-effort to support (iso8601 durations are very similar to the Polars string duration language), but the major difference I see is that iso8601 durations allow for decimals, like 'P1.3D'
. Isn't that a bit ambiguous? I'm not sure what exactly 1.3D
means - how many hours is that, especially if it's on a DST transition?
The following, however, should be feasible and unambiguous. ISO8601 durations, but:
H
, [n]
has to be a whole integer, decimals are not allowedH
, M
, and S
, [n]
can have decimals, and they will be truncated at the given time unit. For example, pl.col('ts').dt.offset_by('PT1.1234S')
advances by 1 second and 123400 microseconds if 'ts'
is of time unit 'us'
, and of 123 milliseconds if 'ts'
is of time unit 'ms'
Q
is also accepted, to mean "quarter"~ "P3M"
is probably fine for quarters?Examples:
pl.col('ts').dt.offset_by('P1M')
advances by 1 calendar monthpl.col('ts').dt.offset_by('P1D')
advances by 1 calendar daypl.col('ts').dt.offset_by('P1.3D')
raises, as 1.3
is not a whole numberpl.col('ts').dt.offset_by('P1DT3H2.123456789S')
advances by 1 calendar day, 3 hours, 2 seconds, and 123456789 nanosecondsDoes this seem reasonable, and would it solve your use case?
I wasnβt aware that they allowed decimals either - it doesnβt seem necessary/useful (or unambiguous) in my opinion.
Also interesting that the Wikipedia article considered PT36H and P1DT12H to behave differently wrt daylight savings. Perhaps the standard covers this in more detail, might be worth further investigation?
but in general what you propose looks fine to me.
interesting that the Wikipedia article considered PT36H and P1DT12H to behave differently wrt daylight savings
so does Polars, so there'd be nothing to change here π
In [14]: df.with_columns(
...: b=pl.col.a.dt.offset_by('36h'),
...: c=pl.col.a.dt.offset_by('1d12h'),
...: )
Out[14]:
shape: (1, 3)
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β a β b β c β
β --- β --- β --- β
β datetime[ΞΌs, β datetime[ΞΌs, β datetime[ΞΌs, β
β Europe/London] β Europe/London] β Europe/London] β
ββββββββββββββββββββββββββββͺβββββββββββββββββββββββββββͺβββββββββββββββββββββββββββ‘
β 2020-10-25 00:00:00 BST β 2020-10-26 11:00:00 GMT β 2020-10-26 12:00:00 GMT β
ββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ
Marking as 'accepted', meaning "OK to support ISO8601 durations as well, alongside the Polars duration string language"
Relatively low-priority, but will get to it when I get a chance
Description
I'm increasingly encountering APIs that return data using ISO8601 durations (https://en.wikipedia.org/wiki/ISO_8601#Durations).
i.e. 15 Minutes is
PT15M
In particular, all our internal graphql endpoints use this representation when requesting raw data.
It's probably too late to change, and undesirable to maintain two parsers but I figured I'd put the suggestion out there anyway.