pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.82k stars 1.7k forks source link

Support reading directly from zipfile.Path objects. #16758

Open daviewales opened 1 month ago

daviewales commented 1 month ago

Description

Polars IO methods such as scan_csv and read_csv support regular Python Path objects for source. However, it would be nice to also be able to read zipfile.Path objects.

Sample file: test.zip

I would expect the following to work if this feature is implemented:

import polars as pl
import zipfile

with zipfile.ZipFile("test.zip") as zip_file:
    test_zip = zipfile.Path(zip_file)
    dfs = {}
    for test_file in test_zip.iterdir():
        if test_file.name.endswith('.csv'):
            dfs[test_file.name] = pl.read_csv(test_file)

Currently I get the following traceback:

``` python TypeError Traceback (most recent call last) Cell In[7], line 7 5 if test_file.name.endswith('.csv'): 6 print(test_file.name) ----> 7 dfs[test_file.name] = pl.read_csv(test_file) File ~\.local\pipx\venvs\ipython\lib\site-packages\polars\io\csv\functions.py:364, in read_csv(source, has_header, columns, new_columns, separator, comment_char, quote_char, skip_rows, dtypes, schema, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, use_pyarrow, storage_options, skip_rows_after_header, row_count_name, row_count_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines) 352 dtypes = { 353 new_to_current.get(column_name, column_name): column_dtype 354 for column_name, column_dtype in dtypes.items() 355 } 357 with _prepare_file_arg( 358 source, 359 encoding=encoding, (...) 362 **storage_options, 363 ) as data: --> 364 df = pl.DataFrame._read_csv( 365 data, 366 has_header=has_header, 367 columns=columns if columns else projection, 368 separator=separator, 369 comment_char=comment_char, 370 quote_char=quote_char, 371 skip_rows=skip_rows, 372 dtypes=dtypes, 373 schema=schema, 374 null_values=null_values, 375 missing_utf8_is_empty_string=missing_utf8_is_empty_string, 376 ignore_errors=ignore_errors, 377 try_parse_dates=try_parse_dates, 378 n_threads=n_threads, 379 infer_schema_length=infer_schema_length, 380 batch_size=batch_size, 381 n_rows=n_rows, 382 encoding=encoding if encoding == "utf8-lossy" else "utf8", 383 low_memory=low_memory, 384 rechunk=rechunk, 385 skip_rows_after_header=skip_rows_after_header, 386 row_count_name=row_count_name, 387 row_count_offset=row_count_offset, 388 sample_size=sample_size, 389 eol_char=eol_char, 390 raise_if_empty=raise_if_empty, 391 truncate_ragged_lines=truncate_ragged_lines, 392 ) 394 if new_columns: 395 return _update_columns(df, new_columns) File ~\.local\pipx\venvs\ipython\lib\site-packages\polars\dataframe\frame.py:769, in DataFrame._read_csv(cls, source, has_header, columns, separator, comment_char, quote_char, skip_rows, dtypes, schema, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_count_name, row_count_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines) 762 raise ValueError( 763 "cannot use glob patterns and integer based projection as `columns` argument" 764 "\n\nUse columns: List[str]" 765 ) 767 projection, columns = handle_projection_columns(columns) --> 769 self._df = PyDataFrame.read_csv( 770 source, 771 infer_schema_length, 772 batch_size, 773 has_header, 774 ignore_errors, 775 n_rows, 776 skip_rows, 777 projection, 778 separator, 779 rechunk, 780 columns, 781 encoding, 782 n_threads, 783 path, 784 dtype_list, 785 dtype_slice, 786 low_memory, 787 comment_char, 788 quote_char, 789 processed_null_values, 790 missing_utf8_is_empty_string, 791 try_parse_dates, 792 skip_rows_after_header, 793 _prepare_row_count_args(row_count_name, row_count_offset), 794 sample_size=sample_size, 795 eol_char=eol_char, 796 raise_if_empty=raise_if_empty, 797 truncate_ragged_lines=truncate_ragged_lines, 798 schema=schema, 799 ) 800 return self TypeError: Object does not have a .read() method. ```

I am using the following workaround:

with zipfile.ZipFile("test.zip") as zip_file:
    test_zip = zipfile.Path(zip_file)
    dfs = {}
    for test_file in test_zip.iterdir():
        if test_file.name.endswith('.csv'):
            with test_file.open('rb') as file:
                dfs[test_file.name] = pl.read_csv(file)

This works, but displays the following warning:

Polars found a filename. Ensure you pass a path to the file instead of a python file object when possible for best performance.

Also, the workaround only works for read_csv. It does not work for scan_csv, which raises the following traceback (probably because read_csv supports IO sources, but scan_csv doesn't):

``` python TypeError Traceback (most recent call last) Cell In[3], line 7 5 if test_file.name.endswith('.csv'): 6 with test_file.open('rb') as file: ----> 7 dfs[test_file.name] = pl.scan_csv(file) File ~\.local\pipx\venvs\ipython\lib\site-packages\polars\io\csv\functions.py:898, in scan_csv(source, has_header, separator, comment_char, quote_char, skip_rows, dtypes, schema, null_values, missing_utf8_is_empty_string, ignore_errors, cache, with_column_names, infer_schema_length, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_count_name, row_count_offset, try_parse_dates, eol_char, new_columns, raise_if_empty, truncate_ragged_lines) 895 if isinstance(source, (str, Path)): 896 source = normalize_filepath(source) --> 898 return pl.LazyFrame._scan_csv( 899 source, 900 has_header=has_header, 901 separator=separator, 902 comment_char=comment_char, 903 quote_char=quote_char, 904 skip_rows=skip_rows, 905 dtypes=dtypes, # type: ignore[arg-type] 906 schema=schema, 907 null_values=null_values, 908 missing_utf8_is_empty_string=missing_utf8_is_empty_string, 909 ignore_errors=ignore_errors, 910 cache=cache, 911 with_column_names=with_column_names, 912 infer_schema_length=infer_schema_length, 913 n_rows=n_rows, 914 low_memory=low_memory, 915 rechunk=rechunk, 916 skip_rows_after_header=skip_rows_after_header, 917 encoding=encoding, 918 row_count_name=row_count_name, 919 row_count_offset=row_count_offset, 920 try_parse_dates=try_parse_dates, 921 eol_char=eol_char, 922 raise_if_empty=raise_if_empty, 923 truncate_ragged_lines=truncate_ragged_lines, 924 ) File ~\.local\pipx\venvs\ipython\lib\site-packages\polars\lazyframe\frame.py:359, in LazyFrame._scan_csv(cls, source, has_header, separator, comment_char, quote_char, skip_rows, dtypes, schema, null_values, missing_utf8_is_empty_string, ignore_errors, cache, with_column_names, infer_schema_length, n_rows, encoding, low_memory, rechunk, skip_rows_after_header, row_count_name, row_count_offset, try_parse_dates, eol_char, raise_if_empty, truncate_ragged_lines) 356 processed_null_values = _process_null_values(null_values) 358 self = cls.__new__(cls) --> 359 self._ldf = PyLazyFrame.new_from_csv( 360 source, 361 separator, 362 has_header, 363 ignore_errors, 364 skip_rows, 365 n_rows, 366 cache, 367 dtype_list, 368 low_memory, 369 comment_char, 370 quote_char, 371 processed_null_values, 372 missing_utf8_is_empty_string, 373 infer_schema_length, 374 with_column_names, 375 rechunk, 376 skip_rows_after_header, 377 encoding, 378 _prepare_row_count_args(row_count_name, row_count_offset), 379 try_parse_dates, 380 eol_char=eol_char, 381 raise_if_empty=raise_if_empty, 382 truncate_ragged_lines=truncate_ragged_lines, 383 schema=schema, 384 ) 385 return self TypeError: argument 'path': 'ZipExtFile' object cannot be converted to 'PyString' ```
cjackal commented 3 weeks ago

While polars support the python standard pathlib.Path source, what actually happens is reading the path string from the input Path (happening at _prepare_file_arg in the traceback) and passing it to rust backend.

polars is yet to support arbitrary python path objects like zipfile.Path (or, frankly speaking, a path object is not a well-defined concept in python thus polars cannot properly check whether it is a path object or not), so I think your workaround is indeed the current best solution. In any case, current internal mechanism of polars requires reading the file content to memory on python side, so there is no performance penalty even if zipfile.Path support is added.

daviewales commented 3 weeks ago

Thanks. I guess any future improvement would depend on the rust backend gaining support for reading/scanning paths in zip files? And then a special case could be added to the Python library to extract the required information from zipfile.Path and pass it to the rust backend?