Closed gerritholl closed 5 years ago
Note that if I close every file neatly, there is no segmentation fault.
In our code, this problem gets triggered because of xarrays lazy handling. If we have
with xr.open_dataset('file.nc') as ds:
val = ds["field"]
return val
then when a caller tries to use val
, xarray reopens the dataset and does not close it again. This means the context manager is actually useless: we're using the context manager to close the file as soon as we have accessed the value, but later the file gets opened again anyway. This is against the intention of the code.
We can avoid this by calling val.load()
from within the context manager, as the linked satpy PR above does. But what is the intention of xarrays design here? Should lazy reading close the file after opening and reading the value? I would say it probably should do something like
if file_was_not_open:
open file
get value
close file # this step currently omitted
return value
else:
get value
return value
is not closing the file after it has been opened for retrieving a "lazy" file by design, or might this be considered a wart/bug?
Looking through the code for open_dataset()
it appears that we have a bug: by default we don't file locks! (We do use these by default for open_mfdataset()
.) This should really be fixed, I will try to make a pull request shortly.
Nevermind, I think we do properly use the right locks. But perhaps there is an issue with re-using open files when using netCDF4/HDF5 groups.
Does this same issue appear if you use engine='h5netcdf'
? That would be an interesting data point.
is not closing the file after it has been opened for retrieving a "lazy" file by design, or might this be considered a wart/bug?
You can achieve this behavior (nearly) by setting xarray.set_options(file_cache_maxsize=1)
.
Note that the default for file_cache_maxsize
is 128, which is suspiciously similar to the number of files/groups at which you encounter issues. In theory we use appropriate locks for automatically closing files when the cache size is exceeded, but this may not be working properly. If you can make a test case with synthetic data (e.g., including a script to make files) I can see if I can reproduce/fix this.
But to clarify the intent here: we don't close files around every access to data because can cause a severe loss in performance, e.g., if you're using dask to read a bunch of chunks out of the same file.
I agree that it's unintuitive how we ignore the explicit context manager. Would it be better if we raised an error in these cases, when you later try to access data from a file that was explicitly closed? It's not immediately obvious to me how to refactor the code to achieve this, but this does seem like it would make for a better user experience.
Would it be better if we raised an error in these cases, when you later try to access data from a file that was explicitly closed?
I would prefer if it stayed the way it is. I can use the context manager to access specific variables but still hold on to the DataArray objects with dask arrays underneath and use them later. In the non-dask case, I'm not sure.
This can also be triggered by a .persist(...)
call, although I don't yet understand the precise circumstances.
This triggers a segmentation fault (in the .persist()
call) on my system, which may be related:
import xarray
import os
import subprocess
xarray.set_options(file_cache_maxsize=1)
f = "/path/to/netcdf/file.nc"
ds1 = xarray.open_dataset(f, "/group1", chunks=1024)
ds2 = xarray.open_dataset(f, "/group2", chunks=1024)
ds_cat = xarray.concat([ds1, ds2])
ds_cat.persist()
subprocess.run(fr"lsof | grep {os.getpid():d} | grep '\.nc$'", shell=True)
But there's something with the specific netcdf file going on, for when I create artificial groups, it does not segfault.
Fatal Python error: Segmentation fault
Thread 0x00007f542bfff700 (most recent call first):
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 470 in _handle_results
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap
Thread 0x00007f5448ff9700 (most recent call first):
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 422 in _handle_tasks
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap
Thread 0x00007f54497fa700 (most recent call first):
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 413 in _handle_workers
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap
Thread 0x00007f5449ffb700 (most recent call first):
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 110 in worker
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap
Thread 0x00007f544a7fc700 (most recent call first):
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 110 in worker
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap
Thread 0x00007f544affd700 (most recent call first):
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 110 in worker
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap
Thread 0x00007f544b7fe700 (most recent call first):
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 110 in worker
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap
Thread 0x00007f544bfff700 (most recent call first):
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 110 in worker
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap
Thread 0x00007f5458a75700 (most recent call first):
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 110 in worker
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap
Thread 0x00007f5459276700 (most recent call first):
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 110 in worker
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap
Thread 0x00007f5459a77700 (most recent call first):
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 110 in worker
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap
Current thread 0x00007f54731236c0 (most recent call first):
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 244 in _open_netcdf4_group
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/file_manager.py", line 173 in acquire
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 56 in get_array
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 74 in _getitem
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/indexing.py", line 778 in explicit_indexing_adapter
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 64 in __getitem__
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/indexing.py", line 510 in __array__
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/numpy/core/numeric.py", line 538 in asarray
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/indexing.py", line 604 in __array__
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/numpy/core/numeric.py", line 538 in asarray
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/variable.py", line 213 in _as_array_or_item
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/variable.py", line 392 in values
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/variable.py", line 297 in data
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/variable.py", line 1204 in set_dims
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/combine.py", line 298 in ensure_common_dims
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/variable.py", line 2085 in concat
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/combine.py", line 305 in _dataset_concat
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/combine.py", line 120 in concat
File "mwe13.py", line 19 in <module>
Segmentation fault (core dumped)
But there's something with the specific netcdf file going on, for when I create artificial groups, it does not segfault.
Can you share a netCDF file that causes this issue?
Thinking about this a little more, I suspect the issue might be related to how xarray opens a file multiple times to read different groups. It is very likely that libraries like netCDF-C don't handle this properly. Instead, we should probably open files once, and reuse them for reading from different groups.
OK, I have a tentative fix up in https://github.com/pydata/xarray/pull/3082.
@gerritholl I have not been able to directly reproduce this issue, so it would be great if you could test my pull request before we merge it to verify whether or not the fix works.
There are some files triggering the problem at ftp://ftp.eumetsat.int/pub/OPS/out/test-data/Test-data-for-External-Users/MTG_FCI_Test-Data/FCI_L1C_24hr_Test_Data_for_Users/1.0/UNCOMPRESSED/ I will test the PR later (latest on Monday)
@shoyer I checked out your branch and the latter test example runs successfully - no segmentation fault and no files left open.
I will test the former test example now.
And I can confirm that the problem I reported originally on May 10 is also gone with #3082.
This is probably the wrong place to report it, but I haven't been able to reproduce this without using xarray. Repeatedly opening NetCDF4/HDF5 files and reading a group from them, triggers a Segmentation Fault after about 130–150 openings. See details below.
Code Sample, a copy-pastable example if possible
Problem description
I have 70 NetCDF-4 files with 70 groups each. When I cycle through the files and read one group from them at the time, after about 130–150 times, the next opening fails with a Segmentation Fault. If I try to read one group from one file at the time, that would require a total of 70*70=4900 openings. If I limit to 20 groups from 20 files in total, it would require 400 openings. In either case, it fails after about 130–150 times. I'm using the Python xarray interface, but the error occurs in the HDF5 library. The message belows includes the traceback in Python:
More usually however, it fails with a Segmentation Fault and no further information.
The failure might happen in any file.
The full output of my script might end with:
prior to the segmentation fault. When running with
-X faulthandler
and a segmentation fault happens:Expected Output
I expect no segmentation fault.
Output of
xr.show_versions()