jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.07k stars 220 forks source link

Allowing null buffers #1524

Closed anjakefala closed 11 months ago

anjakefala commented 1 year ago

Hello! =)

In the spec for the C Data Interface there are allowable scenarios in Arrow-C++ where null buffers get created:

for the null bitmap buffer, if ArrowArray.null_count is 0; for any buffer, if the size in bytes of the corresponding buffer would be 0.

The latter case has been bumped into by a couple of people independently:

While writing a numerically typed empty table to a Feather file v1 with memory map on:

import polars as pl
import pyarrow as pa

t = pa.table([pa.array([], type=pa.uint64())], names=['a'])

pa.feather.write_feather(t, 'out.ipc', version=1)

tbl = pa.feather.read_table('out.ipc', memory_map=True)
tbl['a'].chunks[0].buffers()                                                                                                       
[None, <pyarrow.Buffer address=0x0 size=0 is_cpu=True is_mutable=False>]  <-- Note the 0x0 address

Writing and reading a table through IPC:

import pyarrow as pa

def pass_table_through_ipc_and_back(tbl: pa.Table) -> pa.Table:
    sink = pa.BufferOutputStream()

    with pa.ipc.new_file(sink, tbl.schema) as writer:
        writer.write_table(tbl)

    buf = sink.getvalue()

    with pa.ipc.open_file(buf) as reader:
        tbl_out = reader.read_all()

    return tbl_out

tbl = pa.Table.from_pylist([{'a':[]}], schema=pa.schema([pa.field('a', pa.large_list(pa.int64()))]))
tbl_out = pass_table_through_ipc_and_back(tbl)

print(tbl['a'].chunks[0].buffers())
#  [
#    None, 
#    <pyarrow.Buffer address=0x54868020080 size=16 is_cpu=True is_mutable=True>, 
#    None, 
#    <pyarrow.Buffer address=0x1040cd100 size=0 is_cpu=True is_mutable=True>
#  ]
print(tbl_out['a'].chunks[0].buffers())
#  [
#    None, 
#    <pyarrow.Buffer address=0x54868050178 size=16 is_cpu=True is_mutable=False>, 
#    None, 
#    <pyarrow.Buffer address=0x0 size=0 is_cpu=True is_mutable=False>  <---- Note the 0x0 address
#  ]

If using polars, which uses arrow2, you get an OutOfSpec, thrown here.

From my perspective, considering that null buffers are allowed in Arrow-C++, I would propose removing the requirement in arrow2. My argument is that it is too strict of a requirement. You bump into situations like this where folks need to find workarounds for writing and reading empty tables when using software that has arrow2 as a dependency.

Why did the requirement get added? Is there an important purpose requiring non-null buffers serves (i.e. would something non-ideal happen further down the line)? Is it possible to remove it?

ritchie46 commented 12 months ago

Hi @anjakefala. It can be that we are too strict here. I have to take a better look later to see what we must/can do in such a case.

anjakefala commented 12 months ago

Thanks so much for looking into it @ritchie46!

ritchie46 commented 12 months ago

1528 allows null buffers. Though in the example above it is not yet very useful as we don't support feather v1.

anjakefala commented 11 months ago

Hi @ritchie46!

I just did a local installation of polars using https://github.com/jorgecarleitao/arrow2/commit/92050ec64877fe1348116e0f5dc6e06b949c0519 as the arrow2 revision, and that indeed resolved this issue for me. =) I appreciate it!

Do you happen to have any timelines for when this fix will formally appear in polars?

anjakefala commented 11 months ago

I just saw that the main branch of polars has had its arrow2 dependency updated. :blush: Thank you!