I am using xr.open_dataarray() with chunks and do some simple computation. After that 800mb of RAM is used, no matter whether I close the file explicitly, delete the xarray objects or invoke the Python garbage collector.
What seems to work: do not use the threading Dask scheduler. The issue does not seem to occur with the single-threaded or processes scheduler. Also setting MALLOC_MMAPMAX=40960 seems to solve the issue as suggested above (disclaimer: I don't fully understand the details here).
If I understand things correctly, this indicates that the issue is a consequence of dask/dask#3530. Not sure if there is anything to be fixed on the xarray side or what would be the best work around. I will try to use the processes scheduler.
Not consuming significantly more memory than before opening the NetCDF file.
Minimal Complete Verifiable Example
import gc
import dask
import psutil
import os.path
import numpy as np
import xarray as xr
# a value of 1_000_000 would make much more sense here, but there seems to be a larger memory leak
# with small chunk size for some reason
CHUNK_SIZE = 1_000_000
def print_used_mem():
process = psutil.Process()
print("Used RAM in GB:", process.memory_info().rss / 1024**3)
def read_test_data():
print("Opening DataArray...")
print_used_mem()
data = xr.open_dataarray('tempdata.nc', chunks=CHUNK_SIZE, cache=False)
print_used_mem()
print("Compute sum...")
result = data.sum()
print_used_mem()
print("Print result...")
print("Result", float(result))
print_used_mem()
data.close()
del result
del data
print_used_mem()
def main():
# preparation:
# create about 7.5GB of data (8 * 10**9 / 1024**3)
if not os.path.exists('tempdata.nc'):
print("Creating 7.5GB file tempdata.nc...")
data = xr.DataArray(np.zeros(10**9))
data.to_netcdf('tempdata.nc')
print("Test file created!")
with dask.config.set(scheduler='threading'):
print("Starting read test...")
print_used_mem()
read_test_data()
print("not inside any function any longer")
print_used_mem()
print("Garbage collect:", gc.collect())
print_used_mem()
if __name__ == '__main__':
print("Used memory before test:")
print_used_mem()
print("")
main()
print("\nUsed memory after test:")
print_used_mem()
MVCE confirmation
[X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
[X] Complete example — the example is self-contained, including all data and the text of any traceback.
[X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
[X] New issue — a search of GitHub Issues suggests this is not a duplicate.
[X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
# output running the minimal example above:
Used memory before test:
Used RAM in GB: 0.11433029174804688
Creating 7.5GB file tempdata.nc...
Test file created!
Starting read test...
Used RAM in GB: 0.13946533203125
Opening DataArray...
Used RAM in GB: 0.13946533203125
Used RAM in GB: 0.14622879028320312
Compute sum...
Used RAM in GB: 0.14670944213867188
Print result...
Result 0.0
Used RAM in GB: 0.6771659851074219
Used RAM in GB: 0.6771659851074219
not inside any function any longer
Used RAM in GB: 0.6771659851074219
Garbage collect: 1113
Used RAM in GB: 0.6744232177734375
Used memory after test:
Used RAM in GB: 0.6744194030761719
Anything else we need to know?
No response
Environment
I did the tests in a new conda environment installing only relevant packages: micromamba install -c conda-forge xarray dask netcdf4.
What happened?
I am using xr.open_dataarray() with chunks and do some simple computation. After that 800mb of RAM is used, no matter whether I close the file explicitly, delete the xarray objects or invoke the Python garbage collector.
What seems to work: do not use the threading Dask scheduler. The issue does not seem to occur with the single-threaded or processes scheduler. Also setting MALLOC_MMAPMAX=40960 seems to solve the issue as suggested above (disclaimer: I don't fully understand the details here).
If I understand things correctly, this indicates that the issue is a consequence of dask/dask#3530. Not sure if there is anything to be fixed on the xarray side or what would be the best work around. I will try to use the processes scheduler.
See also #2186, which has been closed without fix and my comment there.
What did you expect to happen?
Not consuming significantly more memory than before opening the NetCDF file.
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
Anything else we need to know?
No response
Environment
I did the tests in a new conda environment installing only relevant packages:
micromamba install -c conda-forge xarray dask netcdf4
.xr.show_versions():
conda list: