Closed JulianCologne closed 1 week ago
Comparing JulianCologne:feat-add-D-M-Y-dot-sepatared-date-pattern-inference
(09a776b) with main
(6730a72)
⚡ 1
improvements
✅ 34
untouched benchmarks
Benchmark | main |
JulianCologne:feat-add-D-M-Y-dot-sepatared-date-pattern-inference |
Change | |
---|---|---|---|---|
⚡ | test_filter2 |
2.8 ms | 2.2 ms | +27.87% |
I think the patterns are assessed in priority order, and I doubt this is the most common form (I'd expect that to be the one separated by "-") 🤔 Can you check? (I'm on mobile at the moment, so can't confirm).
I think the patterns are assessed in priority order, and I doubt this is the most common form (I'd expect that to be the one separated by "-") 🤔 Can you check? (I'm on mobile at the moment, so can't confirm).
@alexander-beedie Afaik the order does NOT matter as is checks for all formats all the time. Was thinking about the order anyway. However, I did check the occurrences (https://github.com/pola-rs/polars/issues/15949#issue-2268305052) and the order by countries actually is:
if we go ahead with this, it should also be added to DATETIME_D_M_Y too (DATE_D_M_Y should be a subset of it)
@MarcoGorelli
Was thinking about this, too.
However, it looks like while dd.mm.YYYY
is very common as date in many countries there does not seem to be an equivalent for the datetime. This is also my experience, as I have never seen a datetime using this date format. 🤔
Afaik the order does NOT matter as is checks for all formats all the time.
IIRC it goes in order until it finds a match, then it keeps using the last successful one (until it may need to look at the next one in the group). maybe let's put them in order of popularity then?
However, it looks like while dd.mm.YYYY is very common as date in many countries there does not seem to be an equivalent for the datetime. This is also my experience, as I have never seen a datetime using this date format.
ok sure
maybe let's put them in order of popularity then?
sure! But how do we measure this? I would suggest
No sure what the performance impact really is but if we do this "micro optimizations" should we also adjust the DATE_Y_M_D
, DATETIME_D_M_Y
and DATETIME_Y_M_D
patterns to put the ISO 8601 standard as the first format as these are used by ALL widespread data engineering/analytics/etl/... tools??
close #14990
dd.mm.YYYY
date pattern which is widely used.to_date
as well as inread_csv