pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.52k stars 1.88k forks source link

String to datetime conversion with custom format #17167

Open tritemio opened 3 months ago

tritemio commented 3 months ago

Checks

Reproducible example

import polars as pl
import pandas as pd

df = pl.DataFrame({'dt': "2024-06-03 20:02:48.6800000"})
dt_format = "%Y-%m-%d %H:%M:%S%.6f0". # NOTE: this format has a trailing 0
df['dt'].str.to_datetime(dt_format, time_unit='ns')

Log output

Traceback (most recent call last):
  File "/Users/anto/src/poste/sda-poste-logistics/script/polars_bug_datetime.py", line 7, in <module>
    df["dt"].str.to_datetime(dt_format, time_unit="ns")
  File "/Users/anto/src/poste/sda-poste-logistics/venv/lib/python3.11/site-packages/polars/series/utils.py", line 107, in wrapper
    return s.to_frame().select_seq(f(*args, **kwargs)).to_series()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/anto/src/poste/sda-poste-logistics/venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 8524, in select_seq
    return self.lazy().select_seq(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/anto/src/poste/sda-poste-logistics/venv/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1909, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: conversion from `str` to `datetime[ns]` failed in column 'dt' for 1 out of 1 values: ["2024-06-03 20:02:48.6800000"]

You might want to try:
- setting `strict=False` to set values that cannot be converted to `null`
- using `str.strptime`, `str.to_date`, or `str.to_datetime` and providing a format string

Issue description

Converting from string to datetime with a format string should allow to decode custom formats.

In this example the input has 7 digits for fractional seconds after the decimal dot. However the last digit is always zero and should be ignored because there is a trailing 0 in the format string.

Instead, polars gives the above error during conversion.

Stripping the extra zero from the string before attempting the conversion works correctly in polars:

dt_format1 = "%Y-%m-%d %H:%M:%S%.6f"  # NOTE: no trailing 0 in the format
df['dt'].str.strip_suffix('0').str.to_datetime(dt_format1, time_unit='ns')

Pandas accepts the original format and convert the string correctly, as does python's datetime

x = pd.to_datetime(
    df["dt"].to_pandas(use_pyarrow_extension_array=True),
    format="%Y-%m-%d %H:%M:%S%.6f0",
)
df = df.with_columns(pl.from_pandas(x))

Expected behavior

The column should be converted to datetime without error, as done in pandas and datetime from python standard lib.

Installed versions

``` --------Version info--------- Polars: 1.0.0-rc.2 Index type: UInt32 Platform: macOS-14.5-arm64-arm-64bit Python: 3.11.3 (main, Sep 1 2023, 14:56:45) [Clang 14.0.3 (clang-1403.0.22.14.1)] ----Optional dependencies---- adbc_driver_manager: 1.0.0 cloudpickle: 3.0.0 connectorx: 0.3.3 deltalake: fastexcel: 0.10.4 fsspec: 2023.12.2 gevent: 24.2.1 great_tables: hvplot: 0.10.0 matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: 2.2.2 pyarrow: 16.1.0 pydantic: 2.7.3 pyiceberg: 0.6.1 sqlalchemy: 2.0.30 torch: xlsx2csv: 0.8.2 xlsxwriter: 3.2.0 ```
MarcoGorelli commented 3 months ago

Thanks For the report, I'm surprised this didn't work, will check

Julian-J-S commented 3 months ago

This might be a "bug" or design decision by chrono (the rust parser) BUT

MarcoGorelli commented 3 months ago

To me this looks like it might be a bug in Chrono:

use chrono::NaiveDateTime;
fn main() {

    let result = NaiveDateTime::parse_from_str(
        "2024-06-03 20:02:48.6800000",
        "%Y-%m-%d %H:%M:%S%.6f0",
    );
    println!("{:?}", result);
    let result = NaiveDateTime::parse_from_str(
        "2024-06-03 20:02:48.680000",
        "%Y-%m-%d %H:%M:%S%.6f",
    );
    println!("{:?}", result);
}

prints out

Err(ParseError(TooShort))
Ok(2024-06-03T20:02:48.680)
tritemio commented 3 months ago

For reference I have created a pytest file that tests several format strings on datetime.strptime, pandas.to_datetime and polars to_datetime. Results shows the valid format string is different across the 3, with datetime and pandas being similar, while polars requires a different format.

I found that polars converts the string even without the trailing 0 in the format when using either %.f or %.6f. This should be the right format for chrono (although different from the python conventions).

Full results:

Screenshot 2024-06-26 at 00 10 38

pytest file:

```python from datetime import datetime import pandas as pd import polars as pl import pytest dt_string = "2024-06-03 20:02:48.6800000" dt_formats = [ "%Y-%m-%d %H:%M:%S%.f", "%Y-%m-%d %H:%M:%S%.6f", "%Y-%m-%d %H:%M:%S.%f", "%Y-%m-%d %H:%M:%S.%6f", ] dt_formats = dt_formats + [f + "0" for f in dt_formats] @pytest.fixture def df(): return pl.DataFrame({"dt": dt_string}) @pytest.mark.parametrize("format", dt_formats) def test_polars(df, format, capsys): with capsys.disabled(): print(format) print(df) t_ref = pl.Series(name="dt", values=[datetime(2024, 6, 3, 20, 2, 48, 680000)]) df2 = df.with_columns(dt=pl.col("dt").str.to_datetime(format, time_unit="ns")) with capsys.disabled(): print(df2) assert (df2["dt"] == t_ref).all() @pytest.mark.parametrize("format", dt_formats) def test_pandas(df, format, capsys): with capsys.disabled(): print(format) print(df) t_ref = pd.Series(name="dt", data=[datetime(2024, 6, 3, 20, 2, 48, 680000)]) t = pd.to_datetime( df["dt"].to_pandas(use_pyarrow_extension_array=True), format=format ) with capsys.disabled(): print(t) assert (t == t_ref).all() @pytest.mark.parametrize("format", dt_formats) def test_datetime(df, format, capsys): dt_ref = datetime(2024, 6, 3, 20, 2, 48, 680000) dt = datetime.strptime(dt_string, format) assert dt == dt_ref ```
Julian-J-S commented 3 months ago

@tritemio correct, python datetime and chrono have different behaviour for some details.

you can check out the documentation

python image

chrono image

image

Watch our for this (chrono)

date = "2020-01-01 10:00:00.1234"