pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.25k stars 1.85k forks source link

Polars does not allow reading a dataframe from piped csv output of another process #17927

Closed tanhevg closed 1 month ago

tanhevg commented 1 month ago

Checks

Reproducible example

import polars as pl
import subprocess

cmd_str = "echo 'foo,bar,baz\n1,2,3\n4,5,6'"
proc = subprocess.Popen(cmd_str, shell=True,
                        stdout=subprocess.PIPE, stderr=subprocess.PIPE, stdin=subprocess.PIPE,
                        text=True, encoding='utf-8')
df = pl.read_csv(proc.stdout)

Log output

/var/folders/_k/jsm1dn4d1fbfqvz8tx7gq5gw0000gq/T/ipykernel_40236/848236565.py:6: UserWarning: Polars found a filename. Ensure you pass a path to the file instead of a python file object when possible for best performance.
  df = pl.read_csv(proc.stdout)

UnsupportedOperation: underlying stream is not seekable
....

Issue description

Polars does not allow reading a dataframe from piped csv output of another process - an UnsupportedOperation: underlying stream is not seekable error is thrown . The issue came up when migrating a bioinformatics pipeline from pandas, when processing a tsv output from an alignment search tool. For now the workaround is to write to a temporary file, but using piped directly, without the temporary file is much less cumbersome.

Expected behavior

I would expect a dataframe to be created, e.g. with pandas:

    foo bar baz
0   1   2   3
1   4   5   6

Installed versions

Also tested on MacOS, with same results. ``` --------Version info--------- Polars: 1.3.0 Index type: UInt32 Platform: Linux-4.18.0-348.23.1.el8_5.x86_64-x86_64-with-glibc2.28 Python: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: nest_asyncio: numpy: openpyxl: pandas: pyarrow: pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
ritchie46 commented 1 month ago

Not really a bug. We never supported this. I think we can support this by collecting the stream first.

deanm0000 commented 1 month ago

If you wrap it in StringIO with read then it works.

from io import StringIO
proc = subprocess.Popen(cmd_str, shell=True,
                        stdout=subprocess.PIPE, stderr=subprocess.PIPE, stdin=subprocess.PIPE,
                        text=True, encoding='utf-8')
df=pl.read_csv(StringIO(proc.stdout.read()))