pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.64k stars 1.99k forks source link

Polars read_ipc does not support all buffer protocol file-like objects #16194

Open broper2 opened 6 months ago

broper2 commented 6 months ago

Checks

Reproducible example

import io
import polars as pl

class ByteArrayBufferedReader:
  def __init__(self, reader: io.BufferedReader):
    self._reader = reader

  @staticmethod
  def from_buffered_reader(reader: io.BufferedReader):
    return ByteArrayBufferedReader(reader)

  def read(self, *args, **kwargs):
    _bytes = self._reader.read(*args, **kwargs)
    return bytearray(_bytes)

  def __getattr__(self, attr):
    return getattr(self._reader, attr)

with open("/tmp/data.ipc", "rb") as f:
  bytearray_reader = ByteArrayBufferedReader.from_buffered_reader(f)
  pl.read_ipc(bytearray_reader)

Log output

thread '<unnamed>' panicked at py-polars/src/file.rs:108:18:
Expecting to be able to downcast into bytes from read result.: DowncastError { from: bytearray(b'\xfc\x00\x00\x00ARROW1'), to: "PyBytes" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
...
PanicException: Expecting to be able to downcast into bytes from read result.: DowncastError { from: bytearray(b'\xfc\x00\x00\x00ARROW1'), to: "PyBytes" }

Issue description

I am trying to pass a custom implementation of python's file-like object to polars read_ipc. The implementation of the read method will return type bytearray not bytes. Polars does not seem to be compatible with this return type as it tries to cast immediately to pyo3::types::PyBytes. However, it seems to me that returning bytearray type from read is in line with python's buffer protocol (see https://docs.python.org/3/glossary.html#term-binary-file and https://docs.python.org/3/glossary.html#term-bytes-like-object).

Expected behavior

I would expect polars to fully support python's file-like buffer protocol

Installed versions

``` --------Version info--------- Polars: 0.20.23 Index type: UInt32 Platform: Linux-3.10.0-1127.19.1.el7.x86_64-x86_64-with-glibc2.2.5 Python: 3.8.13 (default, Aug 2 2022, 18:34:23) [Clang 14.0.3 ] ```
broper2 commented 6 months ago

Hi, any update here? Appreciate any insight on this, thanks

cjackal commented 5 months ago

Not a positive comment, but as the type hint for read_ipc says (source: IO[bytes]), read_ipc explicitly expects the return value or .read() being immutable bytes, rather than mutable bytearray. Due to the safety issue (must introduce unsafe at various code blocks, ...) it usually takes nontrivial work and decision choice for rust (compared to duck-typing-oriented language like python) I think.