pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.51k stars 1.88k forks source link

Date/Datetime inference should allow format to change between `Y-M-D` / `D-M-Y` #16657

Open Julian-J-S opened 4 months ago

Julian-J-S commented 4 months ago

Description

The current date & datetime inference is a bit weird that it allows the format/pattern of the parsed value to change but that change can only be inside a YMD or DMY context.

Examples

Y M D to D M Y

DATES = [
    "2020-01-01",
    "2020/01/01",
    # "2020.01.01", # not yet supported
    "01-01-2020",
]

pl.DataFrame({"date": DATES}).with_columns(
    pl.col("date").str.to_date(strict=False).name.suffix("_parsed"),
)

shape: (3, 2)
┌────────────┬─────────────┐
│ date       ┆ date_parsed │
│ ---        ┆ ---         │
│ str        ┆ date        │
╞════════════╪═════════════╡
│ 2020-01-01 ┆ 2020-01-01  │
│ 2020/01/01 ┆ 2020-01-01  │ >>> format changed -> No Problem ✅ 
│ 01-01-2020 ┆ null        │ >>> format changed -> "Error" 💥 
└────────────┴─────────────┘

D M Y to Y M D

DATES = [
    "01-01-2020",
    "01/01/2020",
    "01.01.2020",
    "2020-01-01",
]

pl.DataFrame({"date": DATES}).with_columns(
    pl.col("date").str.to_date(strict=False).name.suffix("_parsed"),
)

shape: (4, 2)
┌────────────┬─────────────┐
│ date       ┆ date_parsed │
│ ---        ┆ ---         │
│ str        ┆ date        │
╞════════════╪═════════════╡
│ 01-01-2020 ┆ 2020-01-01  │
│ 01/01/2020 ┆ 2020-01-01  │ >>> format changed -> No Problem ✅ 
│ 01.01.2020 ┆ 2020-01-01  │ >>> format changed -> No Problem ✅ 
│ 2020-01-01 ┆ null        │ >>> format changed -> "Error" 💥 
└────────────┴─────────────┘

Discussion

imo this current behaviour does not make sense. I see two different scenarios

Scenario 1: Always same format

When you export data, talk to can API, use python, sql, ... whatever you do a single "system" always produces the same format. In this case everything ok

Scenario 2: Different formats

When you stitch together data for example in a csv or as text in a database it can happened to get all the different formats out there. One person / api / country is using excel / python / sql and the others another way so you end up with many different formats in one column. IMO saying that whenever we have different date/datetime formats in a single column they should be either Y M D or D M Y is far from real life and does not make sense.

Suggestion

Allow switching between all formats! :happy

deanm0000 commented 3 months ago

You can do this with coalesce

pl.select(pl.coalesce(
    pl.Series(DATES).str.to_date(x, strict=False)
    for x in ['%Y-%m-%d', '%Y/%m/%d', '%Y.%m.%d', '%d-%m-%Y']
    ))

As far as making this the default behavior or available with a parameter, I'd disagree as your example switches the order of month and day. I think it's better for users to specify what they want than for polars to have some hard coded guesses. It is already lenient to the changing delimiters. Consider that 01/01/2020 is valid as either d-m-y or m-d-y and that some counties always use m-d-y and never d-m-y so it's not clear how that should be resolved.

Back to your example, we can even simplify the above a bit by using some regex on the delimiters first

pl.select(
    pl.coalesce(pl.Series(DATES).str.replace_all(r"-|/|\.","-").str.to_date(x, strict=False)
    for x in ['%Y-%m-%d', '%d-%m-%Y'])
)
Julian-J-S commented 3 months ago

Thanks for the suggestion. Yes, there are workarounds but this is not my point.

My point is that the current Date & Datetime inference is confusing and inconsistent.

IMO Parsing Date or Datetime in a column should either allow

but not

I'd disagree as your example switches the order of month and day

Disagree here. Have seen it in multiple companies. This happens a lot. Data is appended to csvs. Excel files are shares or send around the globe where peolple from US, UK, BR, CN, ZA, DE, ... are just copying data into the excel sheets. You end up with all sorts of formats. IMO either be strict and allow only 1 format or all of them. Current implementation is not reflecting real world use cases in any way. Why are different separators okay but not different ordering? Either you have clean data or chaos

I think it's better for users to specify what they want than for polars to have some hard coded guesses

100% agree. If you know the format you should definetely always specify it!! But for many exploratory work and working with very messy data it is super usefull as a first step to let polars try to find something before cleaning up.

Consider that 01/01/2020 is valid as either d-m-y or m-d-y and that some counties always use m-d-y and never d-m-y so it's not clear how that should be resolved

This is already the case now! Has nothing to do with my issue. Maybe this should be added to the documentation: currently polars CANNOT parse month first! Only "YMD" or "DMY".

Also this is not only the case with Date but also Datetime

DATES = [
    "2020-01-01",
    "2020/01/01",
    "2020-01-01 1415",
    "2020/01/01 16:17:18",
    "01.01.2020",
]

pl.DataFrame({"date": DATES}).with_columns(
    pl.col("date").str.to_datetime(strict=False).name.suffix("_parsed"),
)

shape: (5, 2)
┌─────────────────────┬─────────────────────┐
│ date                ┆ date_parsed         │
│ ---                 ┆ ---                 │
│ str                 ┆ datetime[μs]        │
╞═════════════════════╪═════════════════════╡
│ 2020-01-01          ┆ 2020-01-01 00:00:00 │ >>> format changed -> No Problem ✅ 
│ 2020/01/01          ┆ 2020-01-01 00:00:00 │ >>> format changed -> No Problem ✅ 
│ 2020-01-01 1415     ┆ 2020-01-01 14:15:00 │ >>> format changed -> No Problem ✅ 
│ 2020/01/01 16:17:18 ┆ 2020-01-01 16:17:18 │ >>> format changed -> No Problem ✅ 
│ 01.01.2020          ┆ null                │ >>> format changed -> "Error" 💥 
└─────────────────────┴─────────────────────┘