pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.12k stars 1.83k forks source link

CSV parsing: ComputeError #15854

Open CameronBieganek opened 4 months ago

CameronBieganek commented 4 months ago

Checks

Reproducible example

Use the following CSV file:

"serial_number","data_date","data_latitude","data_longitude","ign_status","is_power_on","is_zone_1_active","is_zone_1_door_open","unit_mode_detail","engine_hours","electrical_hours","engine_rpm","voltage","ambient_temperature","set_point_1","discharge_air_1","return_air_1","power_off_description","system_operating_mode","zone_1_control_condition"
"6001320386",2021-10-11 20:02:47.000,35.464762,-97.542528,false,False,,False,,6359,0,,13.57,,,,,Countdown,,

And the following Python script:

import polars as pl

schema = {
    "serial_number": pl.Utf8,
    "data_date": pl.Datetime,
    "data_latitude": pl.Float64,
    "data_longitude": pl.Float64,
    "ign_status": pl.Boolean,
    "is_power_on": pl.Boolean,
    "is_zone_1_active": pl.Boolean,
    "is_zone_1_door_open": pl.Boolean,
    "unit_mode_detail": pl.Utf8,
    "system_operating_mode": pl.Utf8,
    "zone_1_control_condition": pl.Utf8,
    "power_off_description": pl.Utf8,
    "engine_hours": pl.Float64,
    "electrical_hours": pl.Float64,
    "engine_rpm": pl.Float64,
    "voltage": pl.Float64,
    "ambient_temperature": pl.Float64,
    "set_point_1": pl.Float64,
    "discharge_air_1": pl.Float64,
    "return_air_1": pl.Float64
}

data = pl.read_csv("test.csv", schema=schema)

Output:

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
File ~\projects\polars_env\csv_parsing_bug.py:28
      3 import polars as pl
      5 schema = {
      6     "serial_number": pl.Utf8,
      7     "data_date": pl.Datetime,
   (...)
     25     "return_air_1": pl.Float64
     26 }
---> 28 data = pl.read_csv("test.csv", schema=schema)

File ~\projects\polars_env\venv\Lib\site-packages\polars\_utils\deprecation.py:134, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    129 @wraps(function)
    130 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    131     _rename_keyword_argument(
    132         old_name, new_name, kwargs, function.__name__, version
    133     )
--> 134     return function(*args, **kwargs)

File ~\projects\polars_env\venv\Lib\site-packages\polars\_utils\deprecation.py:134, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    129 @wraps(function)
    130 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    131     _rename_keyword_argument(
    132         old_name, new_name, kwargs, function.__name__, version
    133     )
--> 134     return function(*args, **kwargs)

File ~\projects\polars_env\venv\Lib\site-packages\polars\_utils\deprecation.py:134, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    129 @wraps(function)
    130 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    131     _rename_keyword_argument(
    132         old_name, new_name, kwargs, function.__name__, version
    133     )
--> 134     return function(*args, **kwargs)

File ~\projects\polars_env\venv\Lib\site-packages\polars\io\csv\functions.py:416, in read_csv(source, has_header, columns, new_columns, separator, comment_prefix, quote_char, skip_rows, dtypes, schema, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, use_pyarrow, storage_options, skip_rows_after_header, row_index_name, row_index_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines, decimal_comma)
    404         dtypes = {
    405             new_to_current.get(column_name, column_name): column_dtype
    406             for column_name, column_dtype in dtypes.items()
    407         }
    409 with prepare_file_arg(
    410     source,
    411     encoding=encoding,
   (...)
    414     storage_options=storage_options,
    415 ) as data:
--> 416     df = _read_csv_impl(
    417         data,
    418         has_header=has_header,
    419         columns=columns if columns else projection,
    420         separator=separator,
    421         comment_prefix=comment_prefix,
    422         quote_char=quote_char,
    423         skip_rows=skip_rows,
    424         dtypes=dtypes,
    425         schema=schema,
    426         null_values=null_values,
    427         missing_utf8_is_empty_string=missing_utf8_is_empty_string,
    428         ignore_errors=ignore_errors,
    429         try_parse_dates=try_parse_dates,
    430         n_threads=n_threads,
    431         infer_schema_length=infer_schema_length,
    432         batch_size=batch_size,
    433         n_rows=n_rows,
    434         encoding=encoding if encoding == "utf8-lossy" else "utf8",
    435         low_memory=low_memory,
    436         rechunk=rechunk,
    437         skip_rows_after_header=skip_rows_after_header,
    438         row_index_name=row_index_name,
    439         row_index_offset=row_index_offset,
    440         sample_size=sample_size,
    441         eol_char=eol_char,
    442         raise_if_empty=raise_if_empty,
    443         truncate_ragged_lines=truncate_ragged_lines,
    444         decimal_comma=decimal_comma,
    445     )
    447 if new_columns:
    448     return _update_columns(df, new_columns)

File ~\projects\polars_env\venv\Lib\site-packages\polars\io\csv\functions.py:559, in _read_csv_impl(source, has_header, columns, separator, comment_prefix, quote_char, skip_rows, dtypes, schema, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_index_name, row_index_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines, decimal_comma)
    555         raise ValueError(msg)
    557 projection, columns = parse_columns_arg(columns)
--> 559 pydf = PyDataFrame.read_csv(
    560     source,
    561     infer_schema_length,
    562     batch_size,
    563     has_header,
    564     ignore_errors,
    565     n_rows,
    566     skip_rows,
    567     projection,
    568     separator,
    569     rechunk,
    570     columns,
    571     encoding,
    572     n_threads,
    573     path,
    574     dtype_list,
    575     dtype_slice,
    576     low_memory,
    577     comment_prefix,
    578     quote_char,
    579     processed_null_values,
    580     missing_utf8_is_empty_string,
    581     try_parse_dates,
    582     skip_rows_after_header,
    583     parse_row_index_args(row_index_name, row_index_offset),
    584     sample_size=sample_size,
    585     eol_char=eol_char,
    586     raise_if_empty=raise_if_empty,
    587     truncate_ragged_lines=truncate_ragged_lines,
    588     decimal_comma=decimal_comma,
    589     schema=schema,
    590 )
    591 return wrap_df(pydf)

ComputeError: could not parse `Countdown` as dtype `f64` at column 'set_point_1' (column number 18)

The current offset in the file is 457 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `Countdown` to the `null_values` list.

Original error: ```remaining bytes non-empty```

Installed versions

``` --------Version info--------- Polars: 0.20.22 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.11.4 (tags/v3.11.4:d2340ef, Jun 7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: hvplot: matplotlib: nest_asyncio: numpy: openpyxl: pandas: pyarrow: pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
CameronBieganek commented 4 months ago

Note that scan_csv works, like this:

data = pl.scan_csv("test.csv", schema=schema)

...where the file and the schema dictionary are the same as above. I'm guessing the error with read_csv is happening because the column order in the schema does not match the column order in the CSV? Normally I expect the order of entries in a dictionary to be immaterial, although technically as of Python 3.6 the built-in dictionary preserves insertion order.

I have a very similar issue open already. Basically this comes down to very poor error messages when the schema argument is involved. Not to mention, the docstring entry for schema could be more explicit about the requirements: e.g. order of entries in the dictionary must match the order of the columns.

cmdlineluser commented 4 months ago

That is odd.

Just a visualization of how schema is treated differently in read and scan:

import tempfile
import polars as pl

f = tempfile.NamedTemporaryFile()
f.write(b"""
A,B
1,2
""".strip())
f.seek(0)

pl.read_csv(f.name, schema={"B": pl.String, "A": pl.Int32})
# shape: (1, 2)
# ┌─────┬─────┐
# │ B   ┆ A   │
# │ --- ┆ --- │
# │ str ┆ i32 │
# ╞═════╪═════╡
# │ 1   ┆ 2   │
# └─────┴─────┘

pl.scan_csv(f.name, schema={"B": pl.String, "A": pl.Int32}).collect()
# shape: (1, 2)
# ┌─────┬─────┐
# │ A   ┆ B   │
# │ --- ┆ --- │
# │ i32 ┆ str │
# ╞═════╪═════╡
# │ 1   ┆ 2   │
# └─────┴─────┘

[Update]: - It seems https://github.com/pola-rs/polars/issues/11723 contains a mention of it.

Found in the redesign issue: