pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.56k stars 1.07k forks source link

Memory leak - xr.open_dataset() not releasing memory. #7404

Open deepgabani8 opened 1 year ago

deepgabani8 commented 1 year ago

What happened?

Let's take this sample netcdf file.

Observe that the memory has not been cleared even after deleting the ds.

Code

import os
import psutil
import xarray as xr
from memory_profiler import profile

@profile
def main():
    path = 'ECMWF_ERA-40_subset.nc'
    print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    ds = xr.open_dataset(path)
    del ds
    print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")

if __name__ == '__main__':
    print(f"Start: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    main()
    print(f"End: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")

Console logs

Start: 186.5859375 MiB
Before opening file: 187.25 MiB
After opening file: 308.09375 MiB
Filename: temp.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6    187.2 MiB    187.2 MiB           1   @profile
     7                                         def main():
     8    187.2 MiB      0.0 MiB           1       path = 'ECMWF_ERA-40_subset.nc'
     9    187.2 MiB      0.0 MiB           1       print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    10    308.1 MiB    120.8 MiB           1       ds = xr.open_dataset(path)
    11    308.1 MiB      0.0 MiB           1       del ds
    12    308.1 MiB      0.0 MiB           1       print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")

End: 308.09375 MiB

I am using xarray==0.20.2and gdal==3.5.1. Sister issue: https://github.com/ecmwf/cfgrib/issues/325#issuecomment-1363011917

What did you expect to happen?

Ideally, memory consumed by the xarray dataset should be released when the dataset is closed/deleted.

Minimal Complete Verifiable Example

No response

MVCE confirmation

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) [GCC 9.4.0] python-bits: 64 OS: Linux OS-release: 4.19.0-22-cloud-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1 xarray: 0.20.2 pandas: 1.3.5 numpy: 1.19.5 scipy: 1.7.3 netCDF4: 1.6.0 pydap: None h5netcdf: 1.0.2 h5py: 3.7.0 Nio: None zarr: 2.12.0 cftime: 1.6.1 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.10 cfgrib: 0.9.10.1 iris: None bottleneck: None dask: 2022.02.0 distributed: 2022.02.0 matplotlib: 3.5.2 cartopy: 0.20.3 seaborn: 0.11.2 numbagg: None fsspec: 2022.7.1 cupy: None pint: None sparse: None setuptools: 59.8.0 pip: 22.2.2 conda: 22.9.0 pytest: None IPython: 7.33.0 sphinx: None
keewis commented 1 year ago

I'm not sure how memory_profiler calculates the memory usage, but I suspect that this happens because python's garbage collector does not have to run immediately after the del.

Can you try manually triggering the garbage collector?

import gc
import os
import psutil
import xarray as xr
from memory_profiler import profile

@profile
def main():
    path = 'ECMWF_ERA-40_subset.nc'
    gc.collect()
    print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    ds = xr.open_dataset(path)
    del ds
    gc.collect()
    print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")

if __name__ == '__main__':
    print(f"Start: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    main()
    print(f"End: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
deepgabani8 commented 1 year ago

It still shows similar memory consumption.

Start: 185.6015625 MiB
Before opening file: 186.24609375 MiB
After opening file: 307.1328125 MiB
Filename: temp.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7    186.0 MiB    186.0 MiB           1   @profile
     8                                         def main():
     9    186.0 MiB      0.0 MiB           1       path = 'ECMWF_ERA-40_subset.nc'
    10    186.0 MiB      0.0 MiB           1       gc.collect()
    11    186.2 MiB      0.2 MiB           1       print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    12    307.1 MiB    120.9 MiB           1       ds = xr.open_dataset(path)
    13    307.1 MiB      0.0 MiB           1       del ds
    14    307.1 MiB      0.0 MiB           1       gc.collect()
    15    307.1 MiB      0.0 MiB           1       print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")

End: 307.1328125 MiB
shoyer commented 1 year ago

If you care about memory usage, you should explicitly close files after you use them, e.g., by calling ds.close() or by using a context manager. Does that work for you?

deepgabani8 commented 1 year ago

Thanks @shoyer , but closing the dataset explicitly also doesn't seem to be releasing the memory.

Start: 185.5078125 MiB
Before opening file: 186.28515625 MiB
After opening file: 307.75390625 MiB
Filename: temp.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7    186.1 MiB    186.1 MiB           1   @profile
     8                                         def main():
     9    186.1 MiB      0.0 MiB           1       path = 'ECMWF_ERA-40_subset.nc'
    10    186.1 MiB      0.0 MiB           1       gc.collect()
    11    186.3 MiB      0.2 MiB           1       print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    12    307.8 MiB    121.5 MiB           1       ds = xr.open_dataset(path)
    13    307.8 MiB      0.0 MiB           1       ds.close()
    14    307.8 MiB      0.0 MiB           1       del ds
    15    307.8 MiB      0.0 MiB           1       gc.collect()
    16    307.8 MiB      0.0 MiB           1       print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")

End: 307.75390625 MiB

I also tried the context manager but the same memory consumption.

Start: 185.5625 MiB
Before opening file: 186.36328125 MiB
After opening file: 307.265625 MiB
Filename: temp.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7    186.2 MiB    186.2 MiB           1   @profile
     8                                         def main():
     9    186.2 MiB      0.0 MiB           1       path = 'ECMWF_ERA-40_subset.nc'
    10    186.2 MiB      0.0 MiB           1       gc.collect()
    11    186.4 MiB      0.2 MiB           1       print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    12    307.3 MiB    120.9 MiB           1       with xr.open_dataset(path) as ds:
    13    307.3 MiB      0.0 MiB           1           ds.close()
    14    307.3 MiB      0.0 MiB           1           del ds
    15    307.3 MiB      0.0 MiB           1       gc.collect()
    16    307.3 MiB      0.0 MiB           1       print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")

End: 307.265625 MiB
DocOtak commented 1 year ago

I've personally seen a lot of what looks like memory reuse in numpy and related libraries. I don't think any of this happens explicitly but have never investigated. I would have some expectation that if memory was not being released as expected, that opening and closing the dataset in a loop would increase memory usage, it didn't on the recent library versions I have.

Start: 89.71875 MiB
Before opening file: 90.203125 MiB
After opening file: 96.6875 MiB
Filename: test.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6     90.2 MiB     90.2 MiB           1   @profile
     7                                         def main():
     8     90.2 MiB      0.0 MiB           1       path = 'ECMWF_ERA-40_subset.nc'
     9     90.2 MiB      0.0 MiB           1       print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
    10     96.7 MiB     -0.1 MiB        1001       for i in range(1000):
    11     96.7 MiB      6.4 MiB        1000           with xr.open_dataset(path) as ds:
    12     96.7 MiB     -0.1 MiB        1000             ...
    13     96.7 MiB      0.0 MiB           1       print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")

End: 96.6875 MiB
Show Versions ``` INSTALLED VERSIONS ------------------ commit: None python: 3.8.13 (default, Jul 23 2022, 17:00:57) [Clang 13.1.6 (clang-1316.0.21.2.5)] python-bits: 64 OS: Darwin OS-release: 22.1.0 machine: arm64 processor: arm byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.0 xarray: 2022.11.0 pandas: 1.4.3 numpy: 1.23.5 scipy: None netCDF4: 1.6.0 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.5.3 cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 56.0.0 pip: 22.0.4 conda: None pytest: 6.2.5 IPython: 8.4.0 sphinx: 5.1.1 ```
deepgabani8 commented 1 year ago

Thanks @DocOtak for the observation.

This is valid only when iterating over the same file. I am observing the same behavior. Here is a memory usage against the iterations. image

When I tried to validate this by iterating over different files, the memory is gradually increasing. Here is a memory usage. image

rachtsingh commented 1 year ago

I can confirm a similar issue, where opening a large number of files in a row causes memory usage to linearly increase (in my case, while watching, from 17GB to 27GB). This means that I can't write long-running jobs because it eventually causes a system failure because of memory usage.

I'm actually uncertain why the job doesn't get OOM killed before the memory issue (my issue to fix with ulimits or cgroups). We're accessing GRIB files using cfgrib (all of which have an index) on secondary SSD storage.