pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.96k stars 1.72k forks source link

Scan zipped files #9601

Open gab23r opened 1 year ago

gab23r commented 1 year ago

Problem description

I wish I could use polars to scan zipped csv (and more ?) files.

This exemple works with read_csv but fails with scan_csv

import os
import shutil

df = pd.DataFrame({'col': [126.3263, 45.23874]})

# create zip
os.mkdir('tmp')
df.to_csv('./tmp/tmp.csv')
shutil.make_archive('myzip', 'zip', 'tmp')

# try to read zipped_file
with zipfile.ZipFile('myzip.zip') as zipFile:
    df = pl.scan_csv(zipFile.read('tmp.csv'))
sm-Fifteen commented 4 months ago

Scan needs to recieve a path, whereas zipfile requires supplying Polars with a file handle to the internal file location, because your zip could contain more than one file. Even on files you can get an unambiguous path towards, though, like czv.gz and csv.xz, scan_csv will actually refuse to read those and ask you to use read_csv instead (see https://github.com/pola-rs/polars/issues/7287).

neverlink commented 3 weeks ago

read_csv can read singlular compressed files just fine. But when globbing, scan_csv gets called, causing it to give up. Not sure why this doesn't work in the current implementation.