pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.63k stars 1.09k forks source link

Uncontrolled memory growth in custom backend #9576

Open sappjw opened 1 month ago

sappjw commented 1 month ago

What happened?

I wrote a custom backend. I'm using it to open a file, operate on the data, remove most of it from the Dataset using .isel, open the next, concatenate, and repeat. I noticed the memory used by the system grew significantly over time even though the size of the Dataset was approximately the same. I was able to reproduce the problem without most of this complexity.

I repeatedly created a dummy Dataset with 25 Variables and observed the number of objects with objgraph after each object creation. I see Variable instances continually increasing, even though I have del'd the Dataset after creating it. I think this suggests that something in xarray is not releasing the Dataset.

Variable               75       +25
ReferenceType        8511        +4
lock                   46        +4
dict                21283        +2
KeyedRef               22        +2
SerializableLock        9        +2
list                23256        +1
set                  1873        +1
method               1893        +1
SeedSequence            4        +1
---
Variable              100       +25
ReferenceType        8515        +4
lock                   50        +4
dict                21285        +2
KeyedRef               24        +2
SerializableLock       11        +2
list                23257        +1
set                  1874        +1
method               1894        +1
SeedSequence            5        +1

I picked a random Variable that was not released and printed the reference chain graph.

import random
objgraph.show_chain(objgraph.find_backref_chain(random.choice(objgraph.by_type('Variable')), objgraph.is_proper_module), filename='graph.png')

graph_variable

What did you expect to happen?

I expected the memory used for the Dataset to be released and garbage-collected. I expected the memory in use to plateau instead of grow.

Minimal Complete Verifiable Example

import os
os.system("mamba install --yes objgraph")

import io

import numpy as np
import xarray as xr
from xarray.backends import BackendArray, BackendEntrypoint
from xarray.backends.common import AbstractDataStore
from xarray.backends.file_manager import CachingFileManager
from xarray.backends.locks import SerializableLock, ensure_lock
from xarray.backends.store import StoreBackendEntrypoint
from xarray.core.indexing import (
    IndexingSupport,
    explicit_indexing_adapter,
)
from xarray.core.utils import Frozen, close_on_error

# This backend creates a random 500MB variable and stores it into a Dataset 25 times
class memtest_DataStore(AbstractDataStore):
    """Store for binary data"""

    def __init__(
        self,
        filename,
        lock=None,
        num_bytes=500_000_000,
        **kwargs,
    ):
        if lock is None:
            lock = SerializableLock()
        self.lock = ensure_lock(lock)
        # Create a lock for the file manager
        self._fm_lock = SerializableLock()
        self._manager = CachingFileManager(
            self._open_file,
            filename,
            lock=self._fm_lock,
            kwargs=kwargs,
        )

        self.num_bytes = num_bytes
        self._num_vars = 25
        self._rng = np.random.default_rng()
        self._xr_obj = None

    def _open_file(self, filename, **kwargs):
        b = self._rng.random(self.num_bytes // 8, dtype=np.float64)
        file_obj = io.BytesIO(b, **kwargs)

        return file_obj

    def _create_dataset(self):
        with self._manager.acquire_context(needs_lock=True) as file_obj:
            file_obj.seek(0)
            fltarr = np.frombuffer(file_obj.getvalue(), np.float64)

        xr_obj = xr.Dataset()
        # Create coordinates
        xr_obj["index"] = (
            ("index",),
            range(len(fltarr)),
        )
        # Create several variables so we can see the change
        for i in range(self._num_vars):
            xr_obj[f"var{i}"] = (
                ("index",),
                fltarr,
            )

        return xr_obj

    @property
    def ds(self):
        if not self._xr_obj:
            self._xr_obj = self._create_dataset()

        return self._xr_obj

    def open_store_variable(self, name, var):
        return self.ds[name].variable

    def get_variables(self):
        return Frozen(self.ds.variables)

    def get_attrs(self):
        return Frozen(self.ds.attrs)

    def get_dimensions(self):
        return Frozen(self.ds.dims)

    def close(self):
        self._manager.close()

class memtest_backend(BackendEntrypoint):
    description = "Memory test backend"

    def open_dataset(
        self,
        filename_or_obj,
        *,
        drop_variables=None,
        lock=None,
        num_bytes=500_000_000,
    ):
        store = memtest_DataStore(
            filename_or_obj,
            lock=lock,
            num_bytes=num_bytes,
        )

        store_entrypoint = StoreBackendEntrypoint()
        with close_on_error(store):
            ds = store_entrypoint.open_dataset(
                store,
                drop_variables=drop_variables,
            )

        return ds

# Loop to test memory growth
# This results in about 10GB of additional memory used by the end of the process
import xarray as xr
import objgraph

objgraph.show_growth(limit=1)
for i in range(10):
    ds = xr.open_dataset(i, engine=memtest_backend)
    # Just to test that something's still holding on to the data
    del ds
    objgraph.show_growth()
    print('---')

MVCE confirmation

Relevant log output

function    18610    +18610
list                 15403    +12798
function             25617     +7007
dict                 13365     +6005
tuple                13360     +5284
CompositeUnit         2954     +2954
PrefixUnit            1526     +1526
ReferenceType         4284     +1181
cell                  4966     +1112
type                  2397      +792
getset_descriptor     3374      +628
---
Variable                         50       +25
ReferenceType                  4298       +14
builtin_function_or_method     2537       +10
lock                             34        +4
dict                          13367        +2
KeyedRef                         19        +2
SerializableLock                  7        +2
list                          15404        +1
set                             952        +1
method                          369        +1
---
Variable               75       +25
ReferenceType        4302        +4
lock                   38        +4
dict                13369        +2
KeyedRef               21        +2
SerializableLock        9        +2
list                15405        +1
set                   953        +1
method                370        +1
SeedSequence            4        +1
---
Variable              100       +25
ReferenceType        4306        +4
lock                   42        +4
dict                13371        +2
KeyedRef               23        +2
SerializableLock       11        +2
list                15406        +1
set                   954        +1
method                371        +1
SeedSequence            5        +1
---
Variable              125       +25
ReferenceType        4310        +4
lock                   46        +4
dict                13373        +2
KeyedRef               25        +2
SerializableLock       13        +2
list                15407        +1
set                   955        +1
method                372        +1
SeedSequence            6        +1
---
Variable              150       +25
ReferenceType        4314        +4
lock                   50        +4
dict                13375        +2
KeyedRef               27        +2
SerializableLock       15        +2
list                15408        +1
set                   956        +1
method                373        +1
SeedSequence            7        +1
---
Variable              175       +25
ReferenceType        4318        +4
lock                   54        +4
dict                13377        +2
KeyedRef               29        +2
SerializableLock       17        +2
list                15409        +1
set                   957        +1
method                374        +1
SeedSequence            8        +1
---
Variable              200       +25
ReferenceType        4322        +4
lock                   58        +4
dict                13379        +2
KeyedRef               31        +2
SerializableLock       19        +2
list                15410        +1
set                   958        +1
method                375        +1
SeedSequence            9        +1
---
Variable              225       +25
ReferenceType        4326        +4
lock                   62        +4
dict                13381        +2
KeyedRef               33        +2
SerializableLock       21        +2
list                15411        +1
set                   959        +1
method                376        +1
SeedSequence           10        +1
---
Variable              250       +25
ReferenceType        4330        +4
lock                   66        +4
dict                13383        +2
KeyedRef               35        +2
SerializableLock       23        +2
list                15412        +1
set                   960        +1
method                377        +1
SeedSequence           11        +1
---

Anything else we need to know?

This crashes the Binder notebook instance since it uses so much memory.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 18:08:52) [GCC 13.3.0] python-bits: 64 OS: Linux OS-release: 5.14.0-505.el9.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.3 libnetcdf: 4.9.2 xarray: 2024.9.0 pandas: 2.2.3 numpy: 1.26.4 scipy: None netCDF4: 1.7.1 pydap: None h5netcdf: 1.2.0 h5py: 3.11.0 zarr: None cftime: 1.6.4 nc_time_axis: None iris: None bottleneck: None dask: 2023.9.3 distributed: 2023.9.3 matplotlib: 3.9.1 cartopy: None seaborn: None numbagg: None fsspec: 2023.9.2 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 68.2.2 pip: 23.2.1 conda: None pytest: None mypy: None IPython: 8.16.1 sphinx: None
max-sixty commented 1 month ago

That does look like a well-formed issue, thanks @sappjw .

I had a look through the code for 10 minutes as I'm not familiar with it. It looks like we maintain a REF_COUNTS dict which increments, and I think is supposed to decrement on the file closing.

To the extent you want to explore more — it would be interesting to see whether that is incrementing-and-not-decrementing as you open the file on each loop. And then why it's not decrementing — does the file need to have .close called rather than just going out of scope...

sappjw commented 1 month ago

Thanks. Inserting a ds.close() before the del ds doesn't change the results (I did have to update the code above slightly to read file_obj.getvalue() instead of file_obj.getbuffer()).

max-sixty commented 1 month ago

OK. Would be interesting to see REF_COUNTS / feel free to explore more on what might be holding the reference...

sappjw commented 1 month ago

If I put a del self._manager in the close() method of memtest_DataStore, the memory growth goes away. Do you think this is a good way to handle the issue? I tried to model this off of the scipy backend and it doesn't free _manager.

max-sixty commented 1 month ago

Very possibly, but I'm unfortunately a long way from the expert on this code, hence my rather basic debugging so far.

Others will hopefully know more. If that suggestion helps and you can put a small PR + test together, that will very likely get some good traction.