pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.8k stars 1.92k forks source link

`scan_csv` does not parse the time column correctly - while `read_csv` does #14038

Open CodeCox opened 8 months ago

CodeCox commented 8 months ago

Checks

Reproducible example

buffer = """col1,col2
2023-1-1,10:11:12
2023-1-2,
2023-1-3,14:15:16
"""
df1 = pl.read_csv(buffer.encode())
df1.write_csv('tmp.csv')

df2 = pl.scan_csv('tmp.csv').collect()
df3 = pl.scan_csv('tmp.csv', try_parse_dates=True).collect()

print(df2)
print(df3)

df4 = pl.read_csv('tmp.csv', try_parse_dates=True)
print(df4)
shape: (3, 2)
┌──────────┬──────────┐
│ col1     ┆ col2     │
│ ---      ┆ ---      │
│ str      ┆ str      │
╞══════════╪══════════╡
│ 2023-1-1 ┆ 10:11:12 │
│ 2023-1-2 ┆ null     │
│ 2023-1-3 ┆ 14:15:16 │
└──────────┴──────────┘
shape: (3, 2)
┌────────────┬──────────┐
│ col1       ┆ col2     │
│ ---        ┆ ---      │
│ date       ┆ str      │
╞════════════╪══════════╡
│ 2023-01-01 ┆ 10:11:12 │
│ 2023-01-02 ┆ null     │
│ 2023-01-03 ┆ 14:15:16 │
└────────────┴──────────┘
shape: (3, 2)
┌────────────┬──────────┐
│ col1       ┆ col2     │
│ ---        ┆ ---      │
│ date       ┆ time     │
╞════════════╪══════════╡
│ 2023-01-01 ┆ 10:11:12 │
│ 2023-01-02 ┆ null     │
│ 2023-01-03 ┆ 14:15:16 │
└────────────┴──────────┘

Log output

No response

Issue description

scan_csv( ... , try_parse_dates=True) is not parsing a time column. (use case: This string format is obtained from an csv export from an Outlook Calendar)

read_csv( ... , try_parse_dates=True) is able to handle this string format. (There are various discussions about parity between `read&scan` so I will not rehash those issues here.)

Expected behavior

should be parsed as in read_csv() ie. column should have a time dtype

Installed versions

``` Polars: 0.20.6 Python: 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 08:03:24) [GCC 12.3.0] ```
cmdlineluser commented 8 months ago

Can reproduce.

An initial attempt to debug what is happening:

It seems that scan_csv calls .with_dtypes()

Which sets schema_overwrite to the schema

The schema is col1: Date, col2: String at this point (not sure where the Date parsing has been done?)

This becomes fixed_schema

Causing parse_dates to return early leaving col2 as a String type: