Open deepgabani8 opened 1 year ago
I'm not sure how memory_profiler
calculates the memory usage, but I suspect that this happens because python's garbage collector does not have to run immediately after the del
.
Can you try manually triggering the garbage collector?
import gc
import os
import psutil
import xarray as xr
from memory_profiler import profile
@profile
def main():
path = 'ECMWF_ERA-40_subset.nc'
gc.collect()
print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
ds = xr.open_dataset(path)
del ds
gc.collect()
print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
if __name__ == '__main__':
print(f"Start: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
main()
print(f"End: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
It still shows similar memory consumption.
Start: 185.6015625 MiB
Before opening file: 186.24609375 MiB
After opening file: 307.1328125 MiB
Filename: temp.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
7 186.0 MiB 186.0 MiB 1 @profile
8 def main():
9 186.0 MiB 0.0 MiB 1 path = 'ECMWF_ERA-40_subset.nc'
10 186.0 MiB 0.0 MiB 1 gc.collect()
11 186.2 MiB 0.2 MiB 1 print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
12 307.1 MiB 120.9 MiB 1 ds = xr.open_dataset(path)
13 307.1 MiB 0.0 MiB 1 del ds
14 307.1 MiB 0.0 MiB 1 gc.collect()
15 307.1 MiB 0.0 MiB 1 print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
End: 307.1328125 MiB
If you care about memory usage, you should explicitly close files after you use them, e.g., by calling ds.close()
or by using a context manager. Does that work for you?
Thanks @shoyer , but closing the dataset explicitly also doesn't seem to be releasing the memory.
Start: 185.5078125 MiB
Before opening file: 186.28515625 MiB
After opening file: 307.75390625 MiB
Filename: temp.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
7 186.1 MiB 186.1 MiB 1 @profile
8 def main():
9 186.1 MiB 0.0 MiB 1 path = 'ECMWF_ERA-40_subset.nc'
10 186.1 MiB 0.0 MiB 1 gc.collect()
11 186.3 MiB 0.2 MiB 1 print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
12 307.8 MiB 121.5 MiB 1 ds = xr.open_dataset(path)
13 307.8 MiB 0.0 MiB 1 ds.close()
14 307.8 MiB 0.0 MiB 1 del ds
15 307.8 MiB 0.0 MiB 1 gc.collect()
16 307.8 MiB 0.0 MiB 1 print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
End: 307.75390625 MiB
I also tried the context manager but the same memory consumption.
Start: 185.5625 MiB
Before opening file: 186.36328125 MiB
After opening file: 307.265625 MiB
Filename: temp.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
7 186.2 MiB 186.2 MiB 1 @profile
8 def main():
9 186.2 MiB 0.0 MiB 1 path = 'ECMWF_ERA-40_subset.nc'
10 186.2 MiB 0.0 MiB 1 gc.collect()
11 186.4 MiB 0.2 MiB 1 print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
12 307.3 MiB 120.9 MiB 1 with xr.open_dataset(path) as ds:
13 307.3 MiB 0.0 MiB 1 ds.close()
14 307.3 MiB 0.0 MiB 1 del ds
15 307.3 MiB 0.0 MiB 1 gc.collect()
16 307.3 MiB 0.0 MiB 1 print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
End: 307.265625 MiB
I've personally seen a lot of what looks like memory reuse in numpy and related libraries. I don't think any of this happens explicitly but have never investigated. I would have some expectation that if memory was not being released as expected, that opening and closing the dataset in a loop would increase memory usage, it didn't on the recent library versions I have.
Start: 89.71875 MiB
Before opening file: 90.203125 MiB
After opening file: 96.6875 MiB
Filename: test.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
6 90.2 MiB 90.2 MiB 1 @profile
7 def main():
8 90.2 MiB 0.0 MiB 1 path = 'ECMWF_ERA-40_subset.nc'
9 90.2 MiB 0.0 MiB 1 print(f"Before opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
10 96.7 MiB -0.1 MiB 1001 for i in range(1000):
11 96.7 MiB 6.4 MiB 1000 with xr.open_dataset(path) as ds:
12 96.7 MiB -0.1 MiB 1000 ...
13 96.7 MiB 0.0 MiB 1 print(f"After opening file: {psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2} MiB")
End: 96.6875 MiB
Thanks @DocOtak for the observation.
This is valid only when iterating over the same file. I am observing the same behavior. Here is a memory usage against the iterations.
When I tried to validate this by iterating over different files, the memory is gradually increasing. Here is a memory usage.
I can confirm a similar issue, where opening a large number of files in a row causes memory usage to linearly increase (in my case, while watching, from 17GB to 27GB). This means that I can't write long-running jobs because it eventually causes a system failure because of memory usage.
I'm actually uncertain why the job doesn't get OOM killed before the memory issue (my issue to fix with ulimits or cgroups). We're accessing GRIB files using cfgrib (all of which have an index) on secondary SSD storage.
What happened?
Let's take this sample netcdf file.
Observe that the memory has not been cleared even after deleting the ds.
Code
Console logs
I am using xarray==0.20.2and gdal==3.5.1. Sister issue: https://github.com/ecmwf/cfgrib/issues/325#issuecomment-1363011917
What did you expect to happen?
Ideally, memory consumed by the xarray dataset should be released when the dataset is closed/deleted.
Minimal Complete Verifiable Example
No response
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
No response
Environment