Open Julian-J-S opened 4 months ago
You can do this with coalesce
pl.select(pl.coalesce(
pl.Series(DATES).str.to_date(x, strict=False)
for x in ['%Y-%m-%d', '%Y/%m/%d', '%Y.%m.%d', '%d-%m-%Y']
))
As far as making this the default behavior or available with a parameter, I'd disagree as your example switches the order of month and day. I think it's better for users to specify what they want than for polars to have some hard coded guesses. It is already lenient to the changing delimiters. Consider that 01/01/2020 is valid as either d-m-y or m-d-y and that some counties always use m-d-y and never d-m-y so it's not clear how that should be resolved.
Back to your example, we can even simplify the above a bit by using some regex on the delimiters first
pl.select(
pl.coalesce(pl.Series(DATES).str.replace_all(r"-|/|\.","-").str.to_date(x, strict=False)
for x in ['%Y-%m-%d', '%d-%m-%Y'])
)
Thanks for the suggestion. Yes, there are workarounds but this is not my point.
My point is that the current Date & Datetime inference is confusing and inconsistent.
IMO Parsing Date or Datetime in a column should either allow
null
but not
I'd disagree as your example switches the order of month and day
Disagree here. Have seen it in multiple companies. This happens a lot. Data is appended to csvs. Excel files are shares or send around the globe where peolple from US, UK, BR, CN, ZA, DE, ... are just copying data into the excel sheets. You end up with all sorts of formats. IMO either be strict and allow only 1 format or all of them. Current implementation is not reflecting real world use cases in any way. Why are different separators okay but not different ordering? Either you have clean data or chaos
I think it's better for users to specify what they want than for polars to have some hard coded guesses
100% agree. If you know the format you should definetely always specify it!! But for many exploratory work and working with very messy data it is super usefull as a first step to let polars try to find something before cleaning up.
Consider that 01/01/2020 is valid as either d-m-y or m-d-y and that some counties always use m-d-y and never d-m-y so it's not clear how that should be resolved
This is already the case now! Has nothing to do with my issue. Maybe this should be added to the documentation: currently polars CANNOT parse month first! Only "YMD" or "DMY".
Also this is not only the case with Date but also Datetime
DATES = [
"2020-01-01",
"2020/01/01",
"2020-01-01 1415",
"2020/01/01 16:17:18",
"01.01.2020",
]
pl.DataFrame({"date": DATES}).with_columns(
pl.col("date").str.to_datetime(strict=False).name.suffix("_parsed"),
)
shape: (5, 2)
┌─────────────────────┬─────────────────────┐
│ date ┆ date_parsed │
│ --- ┆ --- │
│ str ┆ datetime[μs] │
╞═════════════════════╪═════════════════════╡
│ 2020-01-01 ┆ 2020-01-01 00:00:00 │ >>> format changed -> No Problem ✅
│ 2020/01/01 ┆ 2020-01-01 00:00:00 │ >>> format changed -> No Problem ✅
│ 2020-01-01 1415 ┆ 2020-01-01 14:15:00 │ >>> format changed -> No Problem ✅
│ 2020/01/01 16:17:18 ┆ 2020-01-01 16:17:18 │ >>> format changed -> No Problem ✅
│ 01.01.2020 ┆ null │ >>> format changed -> "Error" 💥
└─────────────────────┴─────────────────────┘
Description
The current date & datetime inference is a bit weird that it allows the format/pattern of the parsed value to change but that change can only be inside a YMD or DMY context.
Examples
Y M D
toD M Y
D M Y
toY M D
Discussion
imo this current behaviour does not make sense. I see two different scenarios
Scenario 1: Always same format
When you export data, talk to can API, use python, sql, ... whatever you do a single "system" always produces the same format. In this case everything ok
Scenario 2: Different formats
When you stitch together data for example in a csv or as text in a database it can happened to get all the different formats out there. One person / api / country is using excel / python / sql and the others another way so you end up with many different formats in one column. IMO saying that whenever we have different date/datetime formats in a single column they should be either
Y M D
orD M Y
is far from real life and does not make sense.Suggestion
Allow switching between all formats! :happy