pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
26.63k stars 1.63k forks source link

feat(python,rust): add date pattern `dd.mm.YYYY` #16045

Closed JulianCologne closed 1 week ago

JulianCologne commented 2 weeks ago

close #14990

pl.DataFrame({"D.M.Y": ["01.02.2020", "03.04.2020", "31.12.2020"]}).with_columns(
    pl.all().str.to_date()
)

shape: (3, 1)
┌────────────┐
│ D.M.Y      │
│ ---        │
│ date       │
╞════════════╡
│ 2020-02-01 │
│ 2020-04-03 │
│ 2020-12-31 │
└────────────┘

pl.read_csv(
    source="dates\n01.02.2020\n03.04.2020\n31.12.2020".encode(),
    try_parse_dates=True,
)

shape: (3, 1)
┌────────────┐
│ dates      │
│ ---        │
│ date       │
╞════════════╡
│ 2020-02-01 │
│ 2020-04-03 │
│ 2020-12-31 │
└────────────┘
codspeed-hq[bot] commented 2 weeks ago

CodSpeed Performance Report

Merging #16045 will improve performances by 27.87%

Comparing JulianCologne:feat-add-D-M-Y-dot-sepatared-date-pattern-inference (09a776b) with main (6730a72)

Summary

⚡ 1 improvements ✅ 34 untouched benchmarks

Benchmarks breakdown

Benchmark main JulianCologne:feat-add-D-M-Y-dot-sepatared-date-pattern-inference Change
test_filter2 2.8 ms 2.2 ms +27.87%
alexander-beedie commented 2 weeks ago

I think the patterns are assessed in priority order, and I doubt this is the most common form (I'd expect that to be the one separated by "-") 🤔 Can you check? (I'm on mobile at the moment, so can't confirm).

JulianCologne commented 2 weeks ago

I think the patterns are assessed in priority order, and I doubt this is the most common form (I'd expect that to be the one separated by "-") 🤔 Can you check? (I'm on mobile at the moment, so can't confirm).

@alexander-beedie Afaik the order does NOT matter as is checks for all formats all the time. Was thinking about the order anyway. However, I did check the occurrences (https://github.com/pola-rs/polars/issues/15949#issue-2268305052) and the order by countries actually is:

if we go ahead with this, it should also be added to DATETIME_D_M_Y too (DATE_D_M_Y should be a subset of it)

@MarcoGorelli Was thinking about this, too. However, it looks like while dd.mm.YYYY is very common as date in many countries there does not seem to be an equivalent for the datetime. This is also my experience, as I have never seen a datetime using this date format. 🤔

MarcoGorelli commented 1 week ago

Afaik the order does NOT matter as is checks for all formats all the time.

IIRC it goes in order until it finds a match, then it keeps using the last successful one (until it may need to look at the next one in the group). maybe let's put them in order of popularity then?

However, it looks like while dd.mm.YYYY is very common as date in many countries there does not seem to be an equivalent for the datetime. This is also my experience, as I have never seen a datetime using this date format.

ok sure

JulianCologne commented 1 week ago

maybe let's put them in order of popularity then?

sure! But how do we measure this? I would suggest

No sure what the performance impact really is but if we do this "micro optimizations" should we also adjust the DATE_Y_M_D, DATETIME_D_M_Y and DATETIME_Y_M_D patterns to put the ISO 8601 standard as the first format as these are used by ALL widespread data engineering/analytics/etl/... tools??