pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.8k stars 1.92k forks source link

Multiple - Reading into a single DataFrame - read_csv - Error when using encoding = latin1 #17703

Open dbi-jotho opened 3 months ago

dbi-jotho commented 3 months ago

Checks

Reproducible example

df = pl.read_csv('myfiles*.csv', try_parse_dates=True,
                 separator=separator, encoding='latin1',
                 infer_schema_length=50000)

Log output

(venv) abc@abc:~/dev/src$ python code.py 
file < 128 rows, no statistics determined
no. of chunks: 1 processed by: 1 threads.
file < 128 rows, no statistics determined
no. of chunks: 1 processed by: 1 threads.
file < 128 rows, no statistics determined
no. of chunks: 1 processed by: 1 threads.
file < 128 rows, no statistics determined
no. of chunks: 1 processed by: 1 threads.
Traceback (most recent call last):
  File "code.py", line 180, in <module>
    main()
  File "code.py", line 152, in main
    df = extract_data(file_list, fs_path, row['source'],
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "code.py", line 67, in extract_data
    df = pl.read_csv(fs_path + source + "/" + file_name + '*.'
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 91, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 91, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 91, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/polars/io/csv/functions.py", line 411, in read_csv
    with prepare_file_arg(
         ^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/polars/io/_utils.py", line 242, in prepare_file_arg
    with Path(file).open(encoding=encoding_str) as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/pathlib.py", line 1045, in open
    return io.open(self, mode, buffering, encoding, errors, newline)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'myfiles*.csv'

Issue description

Problem description read_csv supports reading multiple files via glob pattern (https://pola-rs.github.io/polars-book/user-guide/io/multiple/). However, it seems that if I add the latin1 encoding instead of utf8, it does not work. When I run it with encoding utf8 and ignoreerror to true, it works like a charm.

e.g.: df = pl.read_csv("myfiles*.csv", try_parse_dates=True, separator=separator, encoding='latin1') But I get: Error: No such file or directory (os error 2)

It would be great to have it working with this encoding too.

Expected behavior

This should load multiple CSVs into one dataframe when using latin1 encoding.

Installed versions

``` --------Version info--------- Polars: 1.1.0 Index type: UInt32 Platform: Linux-6.1.0-22-amd64-x86_64-with-glibc2.36 Python: 3.11.2 (main, May 2 2024, 11:59:08) [GCC 12.2.0] ```
ritchie46 commented 2 months ago

For multiple readers we go into the rust engine which doesn't support latin1 encoding. For a single file this lack of latin1 encoding is hidden because python reads end decodes the buffer into memory.

We should see if we can improve the error message.

dbi-jotho commented 2 months ago

That makes sense! An improved error message would sure help but not pressing. Thank you!