pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.99k stars 1.94k forks source link

Scuffed error message when importing parquet with pl.read_csv (by accident) #16106

Open Chuck321123 opened 5 months ago

Chuck321123 commented 5 months ago

Checks

Reproducible example

Import a parquet file with pl.read_csv instead

Log output

No response

Issue description

So I made a couple of changes in my code and ran it. The console history suddenly got deleted and all i got was this: Error message No lines where the error occured or anything + the error deleted my whole console history. Luckily i found my own mistake by importing a parquet file with pl.read_csv. However, would be preferable to get a normal error message where you see which line there is something wrong with

Expected behavior

That i get a normal error message + console history doesnt get deleted

Installed versions

``` --------Version info--------- Polars: 0.20.22 Index type: UInt32 Platform: Windows-11-10.0.22631-SP0 Python: 3.12.2 | packaged by Anaconda, Inc. | (main, Feb 27 2024, 17:28:07) [MSC v.1916 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: gevent: hvplot: matplotlib: 3.8.3 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: 3.1.2 pandas: 2.2.1 pyarrow: 15.0.2 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
mcrumiller commented 5 months ago

It sounds like your file had a bad character sequence. For example, \033c will reset your terminal:

(.venv_slim) C:\>python
Python 3.11.7 (tags/v3.11.7:fa7a6f2, Dec  4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print("\033c")

leads to:

>>>

with no console history. My guess is there was a specific byte sequence in your bad data that caused this. I'm not really sure what polars can do to prevent this sort of thing from happening, as these "invalid utf-8 byte sequence" messages usually come from std lib. But maybe it can suppress output of invalid sequences?

ritchie46 commented 5 months ago

That seems like bad luck. So we printed some of the characters and some were special and cleared your terminal output. Seems we should normalize the bytes.

Don't know which characters influence the terminal, but seems like a good first issue.

bjornamr commented 5 months ago

Hi, I would love to contribute. I am very new to the project.

  1. I was not able to reproduce the error above using a parquet file. However, I was able to read a parquet file like suggested above.

    • Is reading a parquet file with a csv reader something we want? Since we already have a parquet reader, we might want to throw an error if it is a parquet file exstension with a suggestion to use parquet reader?
  2. Translation table

All Categories - unicoe org

image

imMoya commented 2 weeks ago

Hi I'm also willing to contribute,

When I tried reading the parquet file with the pl.read_csv method, it actually throws this

File "/Users/ignaciomoyaredondo/opt/anaconda3/envs/polars_env/lib/python3.11/site-packages/polars/io/csv/functions.py", line 649, in _read_csv_impl
    pydf = PyDataFrame.read_csv(
           ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: could not parse `N%�J�FN%�;��>N%���EN%հN%��KN%�?X�%��pN%�DN%lQ�IN%` as dtype `str` at column 'PAR1ȺȺL�*�l�7N%��W�N%.�"�N%���N%N��N%ļ�&N%��cN%�
N%.[�?7N%��`UN%��VtN%0�)8N%lY�' (column number 1)fN%�&N%PN�6N%�O�&N%�Vai"N%蟌�+N%*��;N%�+��=N%�v�g

The current offset in the file is 364 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `N%�J�FN%�;��>N%���EN%հN%��KN%�?X�%��pN%�DN%lQ�IN%` to the `null_values` list.

Original error: ```invalid utf-8 sequence```

Should we try to implement a method which detects whether the file passed is binary or simple text?