Open miaocb opened 6 months ago
'2016/1/1 0:00' isn't a valid format. chrono expects month and day to be 2 digits so 2016/01/01. The same is true for hour. It might (but I'm not sure) also require seconds for a datetime but not sure. @MarcoGorelli do "we" want to make the parser robust to those types of discrepancies?
2016/1/1 0:00 is the default format for saving csv in Microsoft Excel in windows Chinsese version.
Chrono can parse "2016/1/1 0:00" by providing format string "%Y/%m/%d %H:%M".
In polars, the alternative is to read it as a string and then parse it into datetime using lazyframe with format "%Y/%m/%d %H:%M". Since to_datetime in lazyframe and chrono are both able to parse "2016/1/1 0:00", but CsvReader refuse to parse.
use polars::{lazy::dsl::StrptimeOptions, prelude::*};
fn main() {
let csv_file = "example.csv";
let df = CsvReader::from_path(csv_file).unwrap()
.has_header(true)
.with_try_parse_dates(false)
.finish().unwrap();
let options = StrptimeOptions{
format: Some("%Y/%m/%d %H:%M".to_string()),
strict: true,
.. StrptimeOptions::default()
};
let df = df.lazy().with_columns([
col("datetime")
.str()
.to_datetime(None, None, options, lit(""))
]).collect().unwrap();
println!("{}", df);
}
I think it is impossible to infer all date/datetime formats out there.
BUT imo it should be possible to pass in a date/datetime format to read_csv
This feature is accepted but needs implementation 😉 #9550
Actually, polars-time has a very complete date/datetime pattern list, The pattern "%Y/%m/%d %H:%M" which can parse "2016/1/1 0:00" is already there, and it succeeds to find this pattern at be beginning, but I can not figure out why it failed to parse later.
see polars-time-0.39.2\src\chunkedarray\string\infer.rs see polars-time-0.39.2\src\chunkedarray\string\patterns.rs
pub(super) static DATETIME_Y_M_D: &[&str] = &[
// ---
// ISO8601-like, generated via the `iso8601_format_datetime` test fixture
// ---
"%Y/%m/%dT%H:%M:%S",
"%Y-%m-%dT%H:%M:%S",
"%Y/%m/%dT%H%M%S",
"%Y-%m-%dT%H%M%S",
"%Y/%m/%dT%H:%M",
"%Y-%m-%dT%H:%M",
"%Y/%m/%dT%H%M",
"%Y-%m-%dT%H%M",
"%Y/%m/%dT%H:%M:%S.%9f",
"%Y-%m-%dT%H:%M:%S.%9f",
"%Y/%m/%dT%H:%M:%S.%6f",
"%Y-%m-%dT%H:%M:%S.%6f",
"%Y/%m/%dT%H:%M:%S.%3f",
"%Y-%m-%dT%H:%M:%S.%3f",
"%Y/%m/%dT%H%M%S.%9f",
"%Y-%m-%dT%H%M%S.%9f",
"%Y/%m/%dT%H%M%S.%6f",
"%Y-%m-%dT%H%M%S.%6f",
"%Y/%m/%dT%H%M%S.%3f",
"%Y-%m-%dT%H%M%S.%3f",
"%Y/%m/%d",
"%Y-%m-%d",
"%Y/%m/%d %H:%M:%S",
"%Y-%m-%d %H:%M:%S",
"%Y/%m/%d %H%M%S",
"%Y-%m-%d %H%M%S",
"%Y/%m/%d %H:%M",
"%Y-%m-%d %H:%M",
"%Y/%m/%d %H%M",
"%Y-%m-%d %H%M",
"%Y/%m/%d %H:%M:%S.%9f",
"%Y-%m-%d %H:%M:%S.%9f",
"%Y/%m/%d %H:%M:%S.%6f",
"%Y-%m-%d %H:%M:%S.%6f",
"%Y/%m/%d %H:%M:%S.%3f",
"%Y-%m-%d %H:%M:%S.%3f",
"%Y/%m/%d %H%M%S.%9f",
"%Y-%m-%d %H%M%S.%9f",
"%Y/%m/%d %H%M%S.%6f",
"%Y-%m-%d %H%M%S.%6f",
"%Y/%m/%d %H%M%S.%3f",
"%Y-%m-%d %H%M%S.%3f",
// ---
// other
// ---
// we cannot know this one, because polars needs to know
// the length of the parsed fmt
// ---
"%FT%H:%M:%S%.f",
];
I think the issue is here
The pattern could be adjusted so that hour, minute, and second are allowed to only be 1-digit long if there's a separator :
I think the issue is here
The pattern could be adjusted so that hour, minute, and second are allowed to only be 1-digit long if there's a separator
:
You are right. There are three regex expressions, DATETIME_DMY_PATTERN, DATETIME_YMD_PATTERN, DATETIME_YMDZ_PATTERN. Change hour (line 24, 50, 74), minute (line 26, 52, 76), second (line 29, 55, 79) to allow 1-2 digits fixed the problem. Month and day are already allowed to have 1-2 digits.
(?:\d{1,2}) # hour
(?:\d{1,2}) # minute
(?:\d{1,2}) # second
Checks
Reproducible example
contents of example.csv:
datetime 2016/1/1 0:00
Log output
Issue description
2016/1/1 00:00 works, 2016/1/1 0:00 fails.
Expected behavior
CsvReader can parse datetime 2016/1/1 0:00
Installed versions