single-cell-data / TileDB-SOMA

Python and R SOMA APIs using TileDB’s cloud-native format. Ideal for single-cell data at any scale.
https://tiledbsoma.readthedocs.io
MIT License
90 stars 25 forks source link

[c++/python] Memory leak in Python sparse-array write #2928

Closed johnkerl closed 4 weeks ago

johnkerl commented 1 month ago

Problem

There is a memory leak when writing SOMA sparse arrays (maybe other types). Perhaps the Arrow table being written is leaked. Test code below.

Note:

The leak is large enough that it prevents copying any large-ish array. For example, trying to read and rewrite a Census X layer (with a new schema), and that OOMs quickly on a 128GiB host.

Reported by @bkmartinjr.

[sc-53264]

Repro

Test code (will create a test_array in the current working directory):

from __future__ import annotations

import gc
import sys

import tiledbsoma as soma

def copy(from_uri, to_uri, context):
    with soma.SparseNDArray.open(from_uri, mode="r", context=context) as X_from:
        with soma.SparseNDArray.open(to_uri, mode="w", context=context) as X_to:
            for i, tbl in enumerate(X_from.read().tables()):
                print(f"Read {len(tbl)}")
                X_to.write(tbl)

                del tbl
                gc.collect()

                if i == 10:  # OOMs w/o this break
                    break

def create(src_uri, dst_uri, context):
    with soma.open(src_uri, mode="r", context=context) as X:
        n_obs, n_var = X.shape
        type = X.schema.field("soma_data").type

    a = soma.SparseNDArray.create(
        dst_uri,
        type=type,
        shape=(n_obs, n_var),
        platform_config={
            "tiledb": {
                "create": {
                    "capacity": 2**16,
                    "dims": {
                        "soma_dim_0": {
                            "tile": 64,
                            "filters": [{"_type": "ZstdFilter", "level": 9}],
                        },
                        "soma_dim_1": {
                            "tile": 2048,
                            "filters": [
                                "ByteShuffleFilter",
                                {"_type": "ZstdFilter", "level": 9},
                            ],
                        },
                    },
                    "attrs": {
                        "soma_data": {
                            "filters": [
                                "ByteShuffleFilter",
                                {"_type": "ZstdFilter", "level": 9},
                            ]
                        }
                    },
                    "cell_order": "row-major",
                    "tile_order": "row-major",
                    "allows_duplicates": True,
                },
            }
        },
        context=context,
    )
    a.close()
    print(f"Array created at {dst_uri}")

def main():
    src_uri = "s3://cellxgene-census-public-us-west-2/cell-census/2024-07-01/soma/census_data/homo_sapiens/ms/RNA/X/raw/"
    dst_uri = "./test_array/"

    context = soma.SOMATileDBContext(
        tiledb_config={
            "vfs.s3.region": "us-west-2",
            "soma.init_buffer_bytes": 1 * 1024**3,
        }
    )

    create(src_uri, dst_uri, context=context)
    copy(src_uri, dst_uri, context=context)

if __name__ == "__main__":
    sys.exit(main())
eddelbuettel commented 1 month ago

I had similar issues in both tiledb-r and tiledbsoma-r which were identifiable and addressed by measuring with valgrind and making (much) more extended use of nanoarrow. In principle it is very easy to ensure each arrow allocated object has a finalizer, in practice I found not all ended up being freed. Of course, 'code being code', this leak may also be caused by something completely different but valgrind is still gold.

(I had noticed the nighlies for tiledb-r changed from '216 bytes definitely lost' to '224 bytes' but as I just verfied both are due to the same initialization within the AWS SDK we can do nothing about, and for which CRAN does not point fingers at us.)