Closed bkmartinjr closed 5 months ago
@johnkerl - I think this is also a C++ issue (not just Python) as a full-fledged fix is going to require the C++ (write path) code to natively process chunked arrays in a zero-copy manner.
OK @bkmartinjr since the write path is currently not C++, we'll probably need two issues -- one near-term bugfix for the status quo implementation, and a longer-term reminder/tracking task to make sure the C++ code gets that fix
I don't think a python fix is actually possible (unless you consider a "warning" to be a fix). The upstream user has to change the type of the affected arrays (string->large_string) until such time as the underlying TileDB write can handle chunked arrays natively
Dup of #2462
If
DataFrame.write()
receives an Arrow Table, containing a chunked array, it will attempt to combine all chunks into a single contiguous Arrow array using a per-columnChunkedArray.combine_chunks()
call. Thecombine_chunks
call can fail in cases that are confusing and unexpected -- specifically where a Table columns is of typestring
(orbinary), and contains multiple chunks which have a total size too large for
string` (e.g, in excess of 2**31-1 bytes).In this case, the user is forced to change the table column type to
large_string
(orlarge_binary
) to work around thetiledbsoma
forced flattening of the multi-chunk column. This has two undesirable effects:The following demonstrates what looks like reasonable code, but breaks when the pandas dataframe gets too large, resulting in a Table column with >1 chunk, and a total (all chunk) size that is too large:
Running this results in:
Desired behavior: the
DataFrame.write
should process Arrow arrays without flattening them, allowing the memory-efficient use ofstring
orbinary
and a multi-chunk column.Short term, this is likely not possible due to the dependency on the TileDB-Py for write path. Longer term, it seems likely that the C++ tiledbsoma code can simply process the native Arrow ChunkedArray.
Short term it may be worth detecting this case (easily done in DataFrame.write()), and emitting a warning or error with more useful information than the Arrow error.