Open Chuck321123 opened 5 months ago
It sounds like your file had a bad character sequence. For example, \033c
will reset your terminal:
(.venv_slim) C:\>python
Python 3.11.7 (tags/v3.11.7:fa7a6f2, Dec 4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print("\033c")
leads to:
>>>
with no console history. My guess is there was a specific byte sequence in your bad data that caused this. I'm not really sure what polars can do to prevent this sort of thing from happening, as these "invalid utf-8 byte sequence" messages usually come from std lib. But maybe it can suppress output of invalid sequences?
That seems like bad luck. So we printed some of the characters and some were special and cleared your terminal output. Seems we should normalize the bytes.
Don't know which characters influence the terminal, but seems like a good first issue.
Hi, I would love to contribute. I am very new to the project.
I was not able to reproduce the error above using a parquet file. However, I was able to read a parquet file like suggested above.
Translation table
Like @mcrumiller pointed out, \033c resets the terminal. One way of solving this and other issues could be to make a translation table excluding all charcters in the group 'Cc'.
One example of this can be read here: Stack example - translation table
You can read more about the translation table Cc here: Condol Codes - wiki
Hi I'm also willing to contribute,
When I tried reading the parquet file with the pl.read_csv
method, it actually throws this
File "/Users/ignaciomoyaredondo/opt/anaconda3/envs/polars_env/lib/python3.11/site-packages/polars/io/csv/functions.py", line 649, in _read_csv_impl
pydf = PyDataFrame.read_csv(
^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: could not parse `N%�J�FN%�;��>N%���EN%հN%��KN%�?X�%��pN%�DN%lQ�IN%` as dtype `str` at column 'PAR1ȺȺL�*�l�7N%��W�N%.�"�N%���N%N��N%ļ�&N%��cN%�
N%.[�?7N%��`UN%��VtN%0�)8N%lY�' (column number 1)fN%�&N%PN�6N%�O�&N%�Vai"N%蟌�+N%*��;N%�+��=N%�v�g
The current offset in the file is 364 bytes.
You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `N%�J�FN%�;��>N%���EN%հN%��KN%�?X�%��pN%�DN%lQ�IN%` to the `null_values` list.
Original error: ```invalid utf-8 sequence```
Should we try to implement a method which detects whether the file passed is binary or simple text?
Checks
Reproducible example
Import a parquet file with pl.read_csv instead
Log output
No response
Issue description
So I made a couple of changes in my code and ran it. The console history suddenly got deleted and all i got was this: No lines where the error occured or anything + the error deleted my whole console history. Luckily i found my own mistake by importing a parquet file with pl.read_csv. However, would be preferable to get a normal error message where you see which line there is something wrong with
Expected behavior
That i get a normal error message + console history doesnt get deleted
Installed versions