pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.89k stars 1.92k forks source link

Cannot properly read `csv.gz` file in Google Storage bucket #9475

Open sndpgm opened 1 year ago

sndpgm commented 1 year ago

Polars version checks

Issue description

csv.gz file in Google Storage (GS) bucket cannot be properly read using pl.read_csv. The results appear to be garbled:

shape: (0, 1)
┌──────────────────────────────┐
│ �6�d�test_a1.csvJ�I�I�J�1ԩ… │
│ ---                          │
│ str                          │
╞══════════════════════════════╡
└──────────────────────────────┘

Reproducible example

>>> import polars as pl
>>> df_ng = pl.read_csv("gs://my_bucket/path/to/test_a1.csv.gz")
>>> df_ng
shape: (0, 1)
┌──────────────────────────────┐
│ �6�d�test_a1.csvJ�I�I�J�1ԩ… │
│ ---                          │
│ str                          │
╞══════════════════════════════╡
└──────────────────────────────┘

Expected behavior

The expected results are the ones of reading the same file in local PC (file data is linked in the bellow):

# reading csv.gz in local PC is no problem
>>> df_local = pl.read_csv("/my_pc/path/to/test_a1.csv.gz")
>>> df_local
shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ a   ┆ 1   ┆ x   │
│ a   ┆ 1   ┆ y   │
└─────┴─────┴─────┘

test_a1.csv.gz

Installed versions

``` --------Version info--------- Polars: 0.18.3 Index type: UInt32 Platform: macOS-13.4-arm64-arm-64bit Python: 3.10.11 (main, May 17 2023, 14:30:36) [Clang 14.0.6 ] ----Optional dependencies---- numpy: 1.25.0 pandas: 2.0.2 pyarrow: 12.0.1 connectorx: deltalake: fsspec: 2023.6.0 matplotlib: xlsx2csv: xlsxwriter: ```
sndpgm commented 1 year ago

The similar result has occurred in AWS S3.

>>> import polars as pl
>>> df_ng = pl.read_csv("s3://my_bucket/path/to/test_a1.csv.gz")
>>> df_ng
shape: (0, 1)
┌─────────────────────────────┐
│ <�d�test_a1.csvJ�I�I�J�1ԩ… │
│ ---                         │
│ str                         │
╞═════════════════════════════╡
└─────────────────────────────┘
ritchie46 commented 1 year ago

First decompress the file.

sndpgm commented 1 year ago

pl.read_csv does not support compressed files?

SridharCR commented 1 year ago

Looks like pl.read_csv doesn't support gz files from the cloud storage. Probably some issue with the metadata

I'm happy to look into it and work on this item.

29antonioac commented 5 months ago

I'm actually reading csv.gz from local and AWS. Is this issue solved?