Closed Wainberg closed 9 months ago
I don't think this is possible or at least it's very non-trivial. The scanner on the rust side needs access to the file. BytesIO objects aren't portable so it's not as if python can hand that over to the rust engine. It would need to do a round trip between rust and python as it's scanning. Maybe I'm mistaken in the architecture but that's my take.
That said, I'm curious what the use case is.
If you've got the memory to store the csv object in bytesio then why not just read it in? If you want it to be lazy you can do read_csv().lazy()
. If you're going to be streaming then polars silently writes temp files anyway so it seems you might as well just save that memory csv to a temp file. To that end, I think the better performance thing to do is read_csv().write_parquet() and then scan_parquet()
All in all I suspect this is going to be a not-planned but that's not my call.
I think it's more about unifying the behavior of read_csv() and scan_csv() as much as possible, more than any specific use case.
At a minimum, 1 and 2 should to give comprehensible error messages.
Actually, this has been suggested before, with an actual use case: https://github.com/pola-rs/polars/issues/4950
Duplicate of https://github.com/pola-rs/polars/issues/4950
Description
pl.read_csv(StringIO('a\n1'))
workspl.read_csv(BytesIO(b'a\n1'))
worksBut:
pl.scan_csv(StringIO('a\n1')).collect()
gives:pl.scan_csv(BytesIO(b'a\n1')).collect()
gives:pl.scan_csv([StringIO('a\n1'), StringIO('a\n2')]).collect()
gives:pl.scan_csv([BytesIO(b'a\n1'), BytesIO(b'a\n2')]).collect()
gives: