single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
90 stars 25 forks source link

[python] `DataFrame.write` generates error handling chunked array #2120

Closed bkmartinjr closed 5 months ago

bkmartinjr commented 8 months ago

If DataFrame.write() receives an Arrow Table, containing a chunked array, it will attempt to combine all chunks into a single contiguous Arrow array using a per-column ChunkedArray.combine_chunks() call. The combine_chunks call can fail in cases that are confusing and unexpected -- specifically where a Table columns is of type string (or binary), and contains multiple chunks which have a total size too large forstring` (e.g, in excess of 2**31-1 bytes).

In this case, the user is forced to change the table column type to large_string (or large_binary) to work around the tiledbsoma forced flattening of the multi-chunk column. This has two undesirable effects:

The following demonstrates what looks like reasonable code, but breaks when the pandas dataframe gets too large, resulting in a Table column with >1 chunk, and a total (all chunk) size that is too large:

import sys
import numpy as np
import tiledbsoma as soma
import pandas as pd
import pyarrow as pa

def main():
    L = 128

    # will work if N == int((2**31 - 1) // L)
    N = int(2**31 // L)

    df = pd.DataFrame.from_records({"soma_joinid": np.arange(0, N, dtype=np.int64)})
    df["A"] = "A" * L
    print(df)

    tbl = pa.Table.from_pandas(df, preserve_index=False)
    print(tbl)

    with soma.DataFrame.create("test_dataframe", schema=tbl.schema) as A:
        A.write(tbl)

if __name__ == "__main__":
    sys.exit(main())

Running this results in:

$ python tmp/chunk_write.py 
          soma_joinid                                                  A
0                   0  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
1                   1  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
2                   2  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
3                   3  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
4                   4  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
...               ...                                                ...
16777211     16777211  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
16777212     16777212  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
16777213     16777213  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
16777214     16777214  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
16777215     16777215  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...

[16777216 rows x 2 columns]
pyarrow.Table
soma_joinid: int64
A: string
----
soma_joinid: [[0,1,2,3,4,...,16777211,16777212,16777213,16777214,16777215]]
A: [["AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",...,"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"],["AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"]]
Traceback (most recent call last):
  File "/home/bruce/cellxgene-census/tmp/chunk_write.py", line 27, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/bruce/cellxgene-census/tmp/chunk_write.py", line 23, in main
    A.write(tbl)
  File "/home/bruce/cellxgene-census/venv311/lib/python3.11/site-packages/tiledbsoma/_dataframe.py", line 408, in write
    col = values.column(name).combine_chunks()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 746, in pyarrow.lib.ChunkedArray.combine_chunks
  File "pyarrow/array.pxi", line 3776, in pyarrow.lib.concat_arrays
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

Desired behavior: the DataFrame.write should process Arrow arrays without flattening them, allowing the memory-efficient use of string or binary and a multi-chunk column.

Short term, this is likely not possible due to the dependency on the TileDB-Py for write path. Longer term, it seems likely that the C++ tiledbsoma code can simply process the native Arrow ChunkedArray.

Short term it may be worth detecting this case (easily done in DataFrame.write()), and emitting a warning or error with more useful information than the Arrow error.

bkmartinjr commented 8 months ago

@johnkerl - I think this is also a C++ issue (not just Python) as a full-fledged fix is going to require the C++ (write path) code to natively process chunked arrays in a zero-copy manner.

johnkerl commented 8 months ago

OK @bkmartinjr since the write path is currently not C++, we'll probably need two issues -- one near-term bugfix for the status quo implementation, and a longer-term reminder/tracking task to make sure the C++ code gets that fix

bkmartinjr commented 8 months ago

I don't think a python fix is actually possible (unless you consider a "warning" to be a fix). The upstream user has to change the type of the affected arrays (string->large_string) until such time as the underlying TileDB write can handle chunked arrays natively

johnkerl commented 5 months ago

Dup of #2462