pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.34k stars 1.96k forks source link

Empty/full whitespace value gets converted to null in CSV parser if it is in the first column #12832

Closed lizdeika closed 9 months ago

lizdeika commented 11 months ago

Checks

Reproducible example

from io import StringIO

import polars as pl

tsv_data = """
a   xxxxxx  12
b   vvvvvv  13
c   eee 14
        0.0
d   ttt 66
e   ggg 444
f       44
        0.1
"""

column_list = ["col1", "col2", "col3"]
schema = {"col1": pl.Utf8, "col2": pl.Utf8, "col3": pl.Utf8}
dtypes = {"col1": pl.Utf8, "col2": pl.Utf8, "col3": pl.Utf8}

df = pl.read_csv(
    StringIO(tsv_data),
    has_header=False,
    schema=schema,
    dtypes=dtypes,
    new_columns=column_list,
    separator="\t",
)

print(df)

Log output

shape: (8, 3)
┌──────┬────────┬──────┐
│ col1 ┆ col2   ┆ col3 │
│ ---  ┆ ---    ┆ ---  │
│ str  ┆ str    ┆ str  │
╞══════╪════════╪══════╡
│ a    ┆ xxxxxx ┆ 12   │
│ b    ┆ vvvvvv ┆ 13   │
│ c    ┆ eee    ┆ 14   │
│ null ┆        ┆ 0.0  │
│ d    ┆ ttt    ┆ 66   │
│ e    ┆ ggg    ┆ 444  │
│ f    ┆        ┆ 44   │
│ null ┆        ┆ 0.1  │
└──────┴────────┴──────┘

Issue description

Simple TSV file that has 4th and last rows' first column value as SPACE character Those spaces get converted to nulls. No problem for columns that are not first.

Expected behavior

Space is Space, not null

Installed versions

``` --------Version info--------- Polars: 0.19.18 Index type: UInt32 Platform: macOS-14.0-arm64-arm-64bit Python: 3.11.6 (main, Oct 2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: 2023.6.0 gevent: matplotlib: numpy: 1.25.1 openpyxl: pandas: 2.0.3 pyarrow: 12.0.1 pydantic: 1.10.12 pyiceberg: pyxlsb: sqlalchemy: 2.0.16 xlsx2csv: xlsxwriter: ```
orlp commented 11 months ago

This is not limited to tab-separated values, the same happens for CSV with commas as well.

lizdeika commented 11 months ago

Maybe this will help: Setting missing_utf8_is_empty_string=True Space(in first column) gets converted to empty string "" instead of null Looks like space is not recognized as utf8 char if it is a value of the first column

Wainberg commented 11 months ago

Similar whitespace-related CSV bugs: https://github.com/pola-rs/polars/issues/10587, https://github.com/pola-rs/polars/issues/12763

lizdeika commented 11 months ago

Seems I should fallback to pandas

orlp commented 11 months ago

@lizdeika Pull requests are welcome!

taki-mekhalfa commented 9 months ago

Not able to reproduce using polars 0.20.6 anymore; was fixed by: https://github.com/pola-rs/polars/pull/13934