[python] `DataFrame.write` generates error handling chunked array

bkmartinjr commented 8 months ago

If DataFrame.write() receives an Arrow Table, containing a chunked array, it will attempt to combine all chunks into a single contiguous Arrow array using a per-column ChunkedArray.combine_chunks() call. The combine_chunks call can fail in cases that are confusing and unexpected -- specifically where a Table columns is of type string (or binary), and contains multiple chunks which have a total size too large forstring` (e.g, in excess of 2**31-1 bytes).

In this case, the user is forced to change the table column type to large_string (or large_binary) to work around the tiledbsoma forced flattening of the multi-chunk column. This has two undesirable effects:

it uses a lot of extra memory (in addition to the extra column copy, all of the offsets in the array go from int32 to int64)
it will throw an exception if the user uses the default Pandas integration, as demonstrated below.

The following demonstrates what looks like reasonable code, but breaks when the pandas dataframe gets too large, resulting in a Table column with >1 chunk, and a total (all chunk) size that is too large:

import sys
import numpy as np
import tiledbsoma as soma
import pandas as pd
import pyarrow as pa

def main():
    L = 128

    # will work if N == int((2**31 - 1) // L)
    N = int(2**31 // L)

    df = pd.DataFrame.from_records({"soma_joinid": np.arange(0, N, dtype=np.int64)})
    df["A"] = "A" * L
    print(df)

    tbl = pa.Table.from_pandas(df, preserve_index=False)
    print(tbl)

    with soma.DataFrame.create("test_dataframe", schema=tbl.schema) as A:
        A.write(tbl)

if __name__ == "__main__":
    sys.exit(main())

Running this results in:

$ python tmp/chunk_write.py 
          soma_joinid                                                  A
0                   0  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
1                   1  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
2                   2  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
3                   3  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
4                   4  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
...               ...                                                ...
16777211     16777211  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
16777212     16777212  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
16777213     16777213  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
16777214     16777214  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
16777215     16777215  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...

[16777216 rows x 2 columns]
pyarrow.Table
soma_joinid: int64
A: string
----
soma_joinid: [[0,1,2,3,4,...,16777211,16777212,16777213,16777214,16777215]]
A: [["AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",...,"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"],["AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"]]
Traceback (most recent call last):
  File "/home/bruce/cellxgene-census/tmp/chunk_write.py", line 27, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/bruce/cellxgene-census/tmp/chunk_write.py", line 23, in main
    A.write(tbl)
  File "/home/bruce/cellxgene-census/venv311/lib/python3.11/site-packages/tiledbsoma/_dataframe.py", line 408, in write
    col = values.column(name).combine_chunks()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 746, in pyarrow.lib.ChunkedArray.combine_chunks
  File "pyarrow/array.pxi", line 3776, in pyarrow.lib.concat_arrays
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

Desired behavior: the DataFrame.write should process Arrow arrays without flattening them, allowing the memory-efficient use of string or binary and a multi-chunk column.

Short term, this is likely not possible due to the dependency on the TileDB-Py for write path. Longer term, it seems likely that the C++ tiledbsoma code can simply process the native Arrow ChunkedArray.

Short term it may be worth detecting this case (easily done in DataFrame.write()), and emitting a warning or error with more useful information than the Arrow error.

bkmartinjr commented 8 months ago

@johnkerl - I think this is also a C++ issue (not just Python) as a full-fledged fix is going to require the C++ (write path) code to natively process chunked arrays in a zero-copy manner.

johnkerl commented 8 months ago

OK @bkmartinjr since the write path is currently not C++, we'll probably need two issues -- one near-term bugfix for the status quo implementation, and a longer-term reminder/tracking task to make sure the C++ code gets that fix

bkmartinjr commented 8 months ago

I don't think a python fix is actually possible (unless you consider a "warning" to be a fix). The upstream user has to change the type of the affected arrays (string->large_string) until such time as the underlying TileDB write can handle chunked arrays natively

johnkerl commented 5 months ago

Dup of #2462

single-cell-data / TileDB-SOMA

[python] `DataFrame.write` generates error handling chunked array #2120