Scuffed error message when importing parquet with pl.read_csv (by accident)

Chuck321123 commented 5 months ago

Checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Import a parquet file with pl.read_csv instead

Log output

No response

Issue description

So I made a couple of changes in my code and ran it. The console history suddenly got deleted and all i got was this: No lines where the error occured or anything + the error deleted my whole console history. Luckily i found my own mistake by importing a parquet file with pl.read_csv. However, would be preferable to get a normal error message where you see which line there is something wrong with

Expected behavior

That i get a normal error message + console history doesnt get deleted

Installed versions

``` --------Version info--------- Polars: 0.20.22 Index type: UInt32 Platform: Windows-11-10.0.22631-SP0 Python: 3.12.2 | packaged by Anaconda, Inc. | (main, Feb 27 2024, 17:28:07) [MSC v.1916 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: gevent: hvplot: matplotlib: 3.8.3 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: 3.1.2 pandas: 2.2.1 pyarrow: 15.0.2 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```

mcrumiller commented 5 months ago

It sounds like your file had a bad character sequence. For example, \033c will reset your terminal:

(.venv_slim) C:\>python
Python 3.11.7 (tags/v3.11.7:fa7a6f2, Dec  4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print("\033c")

leads to:

>>>

with no console history. My guess is there was a specific byte sequence in your bad data that caused this. I'm not really sure what polars can do to prevent this sort of thing from happening, as these "invalid utf-8 byte sequence" messages usually come from std lib. But maybe it can suppress output of invalid sequences?

ritchie46 commented 5 months ago

That seems like bad luck. So we printed some of the characters and some were special and cleared your terminal output. Seems we should normalize the bytes.

Don't know which characters influence the terminal, but seems like a good first issue.

bjornamr commented 5 months ago

Hi, I would love to contribute. I am very new to the project.

I was not able to reproduce the error above using a parquet file. However, I was able to read a parquet file like suggested above.
- Is reading a parquet file with a csv reader something we want? Since we already have a parquet reader, we might want to throw an error if it is a parquet file exstension with a suggestion to use parquet reader?
Translation table

Like @mcrumiller pointed out, \033c resets the terminal. One way of solving this and other issues could be to make a translation table excluding all charcters in the group 'Cc'.
One example of this can be read here: Stack example - translation table
You can read more about the translation table Cc here: Condol Codes - wiki

All Categories - unicoe org

imMoya commented 2 weeks ago

Hi I'm also willing to contribute,

When I tried reading the parquet file with the pl.read_csv method, it actually throws this

File "/Users/ignaciomoyaredondo/opt/anaconda3/envs/polars_env/lib/python3.11/site-packages/polars/io/csv/functions.py", line 649, in _read_csv_impl
    pydf = PyDataFrame.read_csv(
           ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: could not parse `N%�J�FN%�;��>N%���EN%հN%��KN%�?X�%��pN%�DN%lQ�IN%` as dtype `str` at column 'PAR1ȺȺL�*�l�7N%��W�N%.�"�N%���N%N��N%ļ�&N%��cN%�
N%.[�?7N%��`UN%��VtN%0�)8N%lY�' (column number 1)fN%�&N%PN�6N%�O�&N%�Vai"N%蟌�+N%*��;N%�+��=N%�v�g

The current offset in the file is 364 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `N%�J�FN%�;��>N%���EN%հN%��KN%�?X�%��pN%�DN%lQ�IN%` to the `null_values` list.

Original error: ```invalid utf-8 sequence```

Should we try to implement a method which detects whether the file passed is binary or simple text?

pola-rs / polars