pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.17k stars 1.94k forks source link

CsvReader cannot parse datetime 2016/1/1 0:00 in csv file. #16092

Open miaocb opened 6 months ago

miaocb commented 6 months ago

Checks

Reproducible example

contents of example.csv:

datetime 2016/1/1 0:00

use polars::prelude::*;

fn main() {
    let csv_file = "example.csv";
    let df = CsvReader::from_path(csv_file).unwrap()
    .has_header(true)
    .with_try_parse_dates(true)
    .finish().unwrap();
    println!("{}", df);
}

Log output

file < 128 rows, no statistics determined
no. of chunks: 1 processed by: 1 threads.
thread 'main' panicked at src\main.rs:8:15:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("could not parse `2016/1/1 0:00` as dtype `datetime[μs]` at column 'datetime' (column number 1)\n\nThe current offset in the file is 10 bytes.\n\nYou might want to try:\n- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),\n- specifying correct dtype with the `dtypes` argument\n- setting `ignore_errors` to `True`,\n- adding `2016/1/1 0:00` to the `null_values` list.\n\nOriginal error:  not parse '2016/1/1 0:00' with pattern 'DatetimeYMD'"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
error: process didn't exit successfully: `target\debug\test_datetime.exe` (exit code: 101)

Issue description

2016/1/1 00:00 works, 2016/1/1 0:00 fails.

Expected behavior

CsvReader can parse datetime 2016/1/1 0:00

Installed versions

[dependencies] polars = "0.39.2"
deanm0000 commented 6 months ago

'2016/1/1 0:00' isn't a valid format. chrono expects month and day to be 2 digits so 2016/01/01. The same is true for hour. It might (but I'm not sure) also require seconds for a datetime but not sure. @MarcoGorelli do "we" want to make the parser robust to those types of discrepancies?

miaocb commented 6 months ago

2016/1/1 0:00 is the default format for saving csv in Microsoft Excel in windows Chinsese version.

Chrono can parse "2016/1/1 0:00" by providing format string "%Y/%m/%d %H:%M".

In polars, the alternative is to read it as a string and then parse it into datetime using lazyframe with format "%Y/%m/%d %H:%M". Since to_datetime in lazyframe and chrono are both able to parse "2016/1/1 0:00", but CsvReader refuse to parse.

use polars::{lazy::dsl::StrptimeOptions, prelude::*};

fn main() {
    let csv_file = "example.csv";
    let df = CsvReader::from_path(csv_file).unwrap()
    .has_header(true)
    .with_try_parse_dates(false)
    .finish().unwrap();

    let options = StrptimeOptions{
        format: Some("%Y/%m/%d %H:%M".to_string()),
        strict: true,
        .. StrptimeOptions::default()
    };
    let df = df.lazy().with_columns([
        col("datetime")
        .str()
        .to_datetime(None, None, options, lit(""))
    ]).collect().unwrap();

    println!("{}", df);
}
Julian-J-S commented 6 months ago

I think it is impossible to infer all date/datetime formats out there.

BUT imo it should be possible to pass in a date/datetime format to read_csv

This feature is accepted but needs implementation 😉 #9550

miaocb commented 6 months ago

Actually, polars-time has a very complete date/datetime pattern list, The pattern "%Y/%m/%d %H:%M" which can parse "2016/1/1 0:00" is already there, and it succeeds to find this pattern at be beginning, but I can not figure out why it failed to parse later.

see polars-time-0.39.2\src\chunkedarray\string\infer.rs see polars-time-0.39.2\src\chunkedarray\string\patterns.rs

pub(super) static DATETIME_Y_M_D: &[&str] = &[
    // ---
    // ISO8601-like, generated via the `iso8601_format_datetime` test fixture
    // ---
    "%Y/%m/%dT%H:%M:%S",
    "%Y-%m-%dT%H:%M:%S",
    "%Y/%m/%dT%H%M%S",
    "%Y-%m-%dT%H%M%S",
    "%Y/%m/%dT%H:%M",
    "%Y-%m-%dT%H:%M",
    "%Y/%m/%dT%H%M",
    "%Y-%m-%dT%H%M",
    "%Y/%m/%dT%H:%M:%S.%9f",
    "%Y-%m-%dT%H:%M:%S.%9f",
    "%Y/%m/%dT%H:%M:%S.%6f",
    "%Y-%m-%dT%H:%M:%S.%6f",
    "%Y/%m/%dT%H:%M:%S.%3f",
    "%Y-%m-%dT%H:%M:%S.%3f",
    "%Y/%m/%dT%H%M%S.%9f",
    "%Y-%m-%dT%H%M%S.%9f",
    "%Y/%m/%dT%H%M%S.%6f",
    "%Y-%m-%dT%H%M%S.%6f",
    "%Y/%m/%dT%H%M%S.%3f",
    "%Y-%m-%dT%H%M%S.%3f",
    "%Y/%m/%d",
    "%Y-%m-%d",
    "%Y/%m/%d %H:%M:%S",
    "%Y-%m-%d %H:%M:%S",
    "%Y/%m/%d %H%M%S",
    "%Y-%m-%d %H%M%S",
    "%Y/%m/%d %H:%M",
    "%Y-%m-%d %H:%M",
    "%Y/%m/%d %H%M",
    "%Y-%m-%d %H%M",
    "%Y/%m/%d %H:%M:%S.%9f",
    "%Y-%m-%d %H:%M:%S.%9f",
    "%Y/%m/%d %H:%M:%S.%6f",
    "%Y-%m-%d %H:%M:%S.%6f",
    "%Y/%m/%d %H:%M:%S.%3f",
    "%Y-%m-%d %H:%M:%S.%3f",
    "%Y/%m/%d %H%M%S.%9f",
    "%Y-%m-%d %H%M%S.%9f",
    "%Y/%m/%d %H%M%S.%6f",
    "%Y-%m-%d %H%M%S.%6f",
    "%Y/%m/%d %H%M%S.%3f",
    "%Y-%m-%d %H%M%S.%3f",
    // ---
    // other
    // ---
    // we cannot know this one, because polars needs to know
    // the length of the parsed fmt
    // ---
    "%FT%H:%M:%S%.f",
];
MarcoGorelli commented 6 months ago

I think the issue is here

https://github.com/pola-rs/polars/blob/1195f85221ea05f60eac1c2f6a94cb11028f13b7/crates/polars-time/src/chunkedarray/string/infer.rs#L40-L64

The pattern could be adjusted so that hour, minute, and second are allowed to only be 1-digit long if there's a separator :

miaocb commented 6 months ago

I think the issue is here

https://github.com/pola-rs/polars/blob/1195f85221ea05f60eac1c2f6a94cb11028f13b7/crates/polars-time/src/chunkedarray/string/infer.rs#L40-L64

The pattern could be adjusted so that hour, minute, and second are allowed to only be 1-digit long if there's a separator :

You are right. There are three regex expressions, DATETIME_DMY_PATTERN, DATETIME_YMD_PATTERN, DATETIME_YMDZ_PATTERN. Change hour (line 24, 50, 74), minute (line 26, 52, 76), second (line 29, 55, 79) to allow 1-2 digits fixed the problem. Month and day are already allowed to have 1-2 digits.

            (?:\d{1,2})                # hour
            (?:\d{1,2})                # minute
                (?:\d{1,2})            # second