pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.39k stars 1.97k forks source link

Support BytesIO, StringIO etc. in scan_csv() #12617

Closed Wainberg closed 9 months ago

Wainberg commented 12 months ago

Description

pl.read_csv(StringIO('a\n1')) works pl.read_csv(BytesIO(b'a\n1')) works

But:

  1. pl.scan_csv(StringIO('a\n1')).collect() gives:
    polars.exceptions.ComputeError: error while reading a
    : No such file or directory (os error 2): a
  2. pl.scan_csv(BytesIO(b'a\n1')).collect() gives:
    TypeError: argument 'paths': 'bytes' object cannot be converted to 'PyString'
  3. pl.scan_csv([StringIO('a\n1'), StringIO('a\n2')]).collect() gives:
    TypeError: expected str, bytes or os.PathLike object, not StringIO
  4. pl.scan_csv([BytesIO(b'a\n1'), BytesIO(b'a\n2')]).collect() gives:
    TypeError: expected str, bytes or os.PathLike object, not BytesIO
deanm0000 commented 12 months ago

I don't think this is possible or at least it's very non-trivial. The scanner on the rust side needs access to the file. BytesIO objects aren't portable so it's not as if python can hand that over to the rust engine. It would need to do a round trip between rust and python as it's scanning. Maybe I'm mistaken in the architecture but that's my take.

That said, I'm curious what the use case is.

If you've got the memory to store the csv object in bytesio then why not just read it in? If you want it to be lazy you can do read_csv().lazy(). If you're going to be streaming then polars silently writes temp files anyway so it seems you might as well just save that memory csv to a temp file. To that end, I think the better performance thing to do is read_csv().write_parquet() and then scan_parquet()

All in all I suspect this is going to be a not-planned but that's not my call.

Wainberg commented 12 months ago

I think it's more about unifying the behavior of read_csv() and scan_csv() as much as possible, more than any specific use case.

At a minimum, 1 and 2 should to give comprehensible error messages.

Wainberg commented 12 months ago

Actually, this has been suggested before, with an actual use case: https://github.com/pola-rs/polars/issues/4950

stinodego commented 9 months ago

Duplicate of https://github.com/pola-rs/polars/issues/4950