pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.61k stars 1.08k forks source link

Segmentation fault reading many groups from many files #2954

Closed gerritholl closed 5 years ago

gerritholl commented 5 years ago

This is probably the wrong place to report it, but I haven't been able to reproduce this without using xarray. Repeatedly opening NetCDF4/HDF5 files and reading a group from them, triggers a Segmentation Fault after about 130–150 openings. See details below.

Code Sample, a copy-pastable example if possible

from itertools import count, product
import netCDF4
import glob
import xarray

files = sorted(glob.glob("/media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/*BODY*.nc"))

# get all groups
def get_groups(ds, pre=""):
    for g in ds.groups.keys():
        nm = pre + "/" + g
        yield from get_groups(ds[g], nm)
        yield nm
with netCDF4.Dataset(files[0]) as ds:
    groups = sorted(list(get_groups(ds)))
print("total groups", len(groups), "total files", len(files))
ds_all = []
ng = 20
nf = 20
print("using groups", ng, "using files", nf)
for (i, (g, f)) in zip(count(), product(groups[:ng], files[:nf])):
    print("attempting", i, "group", g, "from", f)
    ds = xarray.open_dataset(
            f, group=g, decode_cf=False)
    ds_all.append(ds)

Problem description

I have 70 NetCDF-4 files with 70 groups each. When I cycle through the files and read one group from them at the time, after about 130–150 times, the next opening fails with a Segmentation Fault. If I try to read one group from one file at the time, that would require a total of 70*70=4900 openings. If I limit to 20 groups from 20 files in total, it would require 400 openings. In either case, it fails after about 130–150 times. I'm using the Python xarray interface, but the error occurs in the HDF5 library. The message belows includes the traceback in Python:

HDF5-DIAG: Error detected in HDF5 (1.10.4) thread 140107218855616:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [9/1985]
  #000: H5D.c line 485 in H5Dget_create_plist(): Can't get creation plist
    major: Dataset
    minor: Can't get value
  #001: H5Dint.c line 3159 in H5D__get_create_plist(): can't get dataset's creation property list
    major: Dataset
    minor: Can't get value
  #002: H5Dint.c line 3296 in H5D_get_create_plist(): datatype conversion failed
    major: Dataset                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
    minor: Can't convert datatypes
  #003: H5T.c line 5025 in H5T_convert(): datatype conversion failed
    major: Datatype
    minor: Can't convert datatypes
  #004: H5Tconv.c line 3227 in H5T__conv_vlen(): can't read VL data
    major: Datatype
    minor: Read failed
  #005: H5Tvlen.c line 853 in H5T_vlen_disk_read(): Unable to read VL information
    major: Datatype
    minor: Read failed
  #006: H5HG.c line 611 in H5HG_read(): unable to protect global heap
    major: Heap
    minor: Unable to protect metadata
  #007: H5HG.c line 264 in H5HG__protect(): unable to protect global heap
    major: Heap
    minor: Unable to protect metadata
  #008: H5AC.c line 1591 in H5AC_protect(): unable to get logging status
    major: Object cache
    minor: Internal error detected
  #009: H5Clog.c line 313 in H5C_get_logging_status(): cache magic value incorrect
    major: Invalid arguments to routine
    minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.10.4) thread 140107218855616:
  #000: H5L.c line 1138 in H5Literate(): link iteration failed
    major: Links
    minor: Iteration failed
  #001: H5L.c line 3440 in H5L__iterate(): link iteration failed
    major: Links
    minor: Iteration failed
  #002: H5Gint.c line 893 in H5G_iterate(): error iterating over links
    major: Symbol table
    minor: Iteration failed
  #003: H5Gobj.c line 683 in H5G__obj_iterate(): can't iterate over dense links
    major: Symbol table
    minor: Iteration failed
  #004: H5Gdense.c line 1054 in H5G__dense_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
  #005: H5Glink.c line 493 in H5G__link_iterate_table(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
Traceback (most recent call last):
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/file_manager.py", line 167, in acquire
    file = self._cache[self._key]
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/lru_cache.py", line 41, in __getitem__
    value = self._cache[key]
KeyError: [<function _open_netcdf4_group at 0x7f6d27b0f7b8>, ('/media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410114417_GTT_DEV_20170410113908_20170410113917_N__C_0070_0065.nc', CombinedLock([<SerializableLock: 30e581d6-154c-486b-8b6a-b9a6c347f4e4>, <SerializableLock: bb132fc5-db57-499d-bc1f-661bc0025616>])), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('group', '/data/vis_04/measured'), ('persist
', False))]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/mwe9.py", line 24, in <module>
    f, group=g, decode_cf=False)
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/api.py", line 363, in open_dataset
    filename_or_obj, group=group, lock=lock, **backend_kwargs)
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 352, in open
    return cls(manager, lock=lock, autoclose=autoclose)
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 311, in __init__
    self.format = self.ds.data_model
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 356, in ds
    return self._manager.acquire().value
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/file_manager.py", line 173, in acquire
    file = self._opener(*self._args, **kwargs)
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 244, in _open_netcdf4_group
    ds = nc4.Dataset(filename, mode=mode, **kwargs)
  File "netCDF4/_netCDF4.pyx", line 2291, in netCDF4._netCDF4.Dataset.__init__
  File "netCDF4/_netCDF4.pyx", line 1855, in netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -101] NetCDF: HDF error: b'/media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410114417_GTT_DEV_20170410113908_20170410113917_N__C_0070_0065.nc'

More usually however, it fails with a Segmentation Fault and no further information.

The failure might happen in any file.

The full output of my script might end with:

attempting 137 group /data/ir_123/measured from /media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410113734_GTT_DEV_20170410113225_20170410113234_N__C_0070_0018.nc
attempting 138 group /data/ir_123/measured from /media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410113742_GTT_DEV_20170410113234_20170410113242_N__C_0070_0019.nc
attempting 139 group /data/ir_123/measured from /media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410113751_GTT_DEV_20170410113242_20170410113251_N__C_0070_0020.nc
attempting 140 group /data/ir_123/quality_channel from /media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410113508_GTT_DEV_20170410113000_20170410113008_N__C_0070_0001.nc
Fatal Python error: Segmentation fault

prior to the segmentation fault. When running with -X faulthandler and a segmentation fault happens:

Fatal Python error: Segmentation fault

Current thread 0x00007ff6ab89d6c0 (most recent call first):
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 244 in _open_netcdf4_group
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/file_manager.py", line 173 in acquire
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 356 in ds
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 311 in __init__
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 352 in open
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/api.py", line 363 in open_dataset
  File "/tmp/mwe9.py", line 24 in <module>
Segmentation fault (core dumped)

Expected Output

I expect no segmentation fault.

Output of xr.show_versions()

``` INSTALLED VERSIONS ------------------ commit: None python: 3.7.1 | packaged by conda-forge | (default, Feb 18 2019, 01:42:00) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.12.14-lp150.12.58-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2 xarray: 0.12.0 pandas: 0.24.2 numpy: 1.16.2 scipy: 1.2.1 netCDF4: 1.5.0.1 pydap: None h5netcdf: 0.7.1 h5py: 2.9.0 Nio: None zarr: None cftime: 1.0.3.4 nc_time_axis: None PseudonetCDF: None rasterio: 1.0.22 cfgrib: None iris: None bottleneck: None dask: 1.1.5 distributed: 1.26.1 matplotlib: 3.0.3 cartopy: 0.17.0 seaborn: None setuptools: 40.8.0 pip: 19.0.3 conda: None pytest: None IPython: 7.4.0 sphinx: 2.0.0 ``` The machine is running openSUSE 15.0 with `Linux oflws222 4.12.14-lp150.12.58-default #1 SMP Mon Apr 1 15:20:46 UTC 2019 (58fcc15) x86_64 x86_64 x86_64 GNU/Linux`. The problem has also been reported on other machines, such as one running CentOS Linux release 7.6.1810 (Core) with `Linux oflks333.dwd.de 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux` The HDF5 installation on my machine is from the SuSe package. From `strings /usr/lib64/libhdf5.so`, I get: ``` SUMMARY OF THE HDF5 CONFIGURATION ================================= General Information: ------------------- HDF5 Version: 1.10.1 Host system: x86_64-suse-linux-gnu Byte sex: little-endian Installation point: /usr Compiling Options: ------------------ Build Mode: production Debugging Symbols: no Asserts: no Profiling: no Optimization Level: high Linking Options: ---------------- Libraries: static, shared Statically Linked Executables: LDFLAGS: H5_LDFLAGS: AM_LDFLAGS: Extra libraries: -lpthread -lz -ldl -lm Archiver: ar Ranlib: ranlib Languages: ---------- C: yes C Compiler: /usr/bin/gcc CPPFLAGS: H5_CPPFLAGS: -D_GNU_SOURCE -D_POSIX_C_SOURCE=200112L -DNDEBUG -UH5_DEBUG_API AM_CPPFLAGS: C Flags: -fmessage-length=0 -grecord-gcc-switches -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -g H5 C Flags: -std=c99 -pedantic -Wall -W -Wundef -Wshadow -Wpointer-arith -Wbad-function-cast -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wredundant-decls -Wnested-externs -finline-functions -s -Wno-inline -Wno-aggregate-return -O AM C Flags: Shared C Library: yes Static C Library: yes Fortran: yes Fortran Compiler: /usr/bin/gfortran Fortran Flags: H5 Fortran Flags: -pedantic -Wall -Wextra -Wunderflow -Wimplicit-interface -Wsurprising -Wno-c-binding-type -s -O2 AM Fortran Flags: Shared Fortran Library: yes Static Fortran Library: yes C++: yes C++ Compiler: /usr/bin/g++ C++ Flags: -fmessage-length=0 -grecord-gcc-switches -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -g H5 C++ Flags: -pedantic -Wall -W -Wundef -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Wredundant-decls -Winline -Wsign-promo -Woverloaded-virtual -Wold-style-cast -Weffc++ -Wreorder -Wnon-virtual-dtor -Wctor-dtor-privacy -Wabi -finline-functions -s -O AM C++ Flags: Shared C++ Library: yes Static C++ Library: yes Java: no Features: --------- Parallel HDF5: no High-level library: yes Threadsafety: yes Default API mapping: v110 With deprecated public symbols: yes I/O filters (external): deflate(zlib) MPE: no Direct VFD: no dmalloc: no Packages w/ extra debug output: none API tracing: no Using memory checker: no Memory allocation sanity checks: no Metadata trace file: no Function stack tracing: no Strict file format checks: no Optimization instrumentation: no ```
gerritholl commented 5 years ago

Note that if I close every file neatly, there is no segmentation fault.

gerritholl commented 5 years ago

In our code, this problem gets triggered because of xarrays lazy handling. If we have

with xr.open_dataset('file.nc') as ds:
    val = ds["field"]
return val

then when a caller tries to use val, xarray reopens the dataset and does not close it again. This means the context manager is actually useless: we're using the context manager to close the file as soon as we have accessed the value, but later the file gets opened again anyway. This is against the intention of the code.

We can avoid this by calling val.load() from within the context manager, as the linked satpy PR above does. But what is the intention of xarrays design here? Should lazy reading close the file after opening and reading the value? I would say it probably should do something like

if file_was_not_open:
    open file
    get value
    close file # this step currently omitted
    return value
else:
    get value
    return value

is not closing the file after it has been opened for retrieving a "lazy" file by design, or might this be considered a wart/bug?

shoyer commented 5 years ago

Looking through the code for open_dataset() it appears that we have a bug: by default we don't file locks! (We do use these by default for open_mfdataset().) This should really be fixed, I will try to make a pull request shortly.

shoyer commented 5 years ago

Nevermind, I think we do properly use the right locks. But perhaps there is an issue with re-using open files when using netCDF4/HDF5 groups.

Does this same issue appear if you use engine='h5netcdf'? That would be an interesting data point.

shoyer commented 5 years ago

is not closing the file after it has been opened for retrieving a "lazy" file by design, or might this be considered a wart/bug?

You can achieve this behavior (nearly) by setting xarray.set_options(file_cache_maxsize=1).

Note that the default for file_cache_maxsize is 128, which is suspiciously similar to the number of files/groups at which you encounter issues. In theory we use appropriate locks for automatically closing files when the cache size is exceeded, but this may not be working properly. If you can make a test case with synthetic data (e.g., including a script to make files) I can see if I can reproduce/fix this.

But to clarify the intent here: we don't close files around every access to data because can cause a severe loss in performance, e.g., if you're using dask to read a bunch of chunks out of the same file.

I agree that it's unintuitive how we ignore the explicit context manager. Would it be better if we raised an error in these cases, when you later try to access data from a file that was explicitly closed? It's not immediately obvious to me how to refactor the code to achieve this, but this does seem like it would make for a better user experience.

djhoese commented 5 years ago

Would it be better if we raised an error in these cases, when you later try to access data from a file that was explicitly closed?

I would prefer if it stayed the way it is. I can use the context manager to access specific variables but still hold on to the DataArray objects with dask arrays underneath and use them later. In the non-dask case, I'm not sure.

gerritholl commented 5 years ago

This can also be triggered by a .persist(...) call, although I don't yet understand the precise circumstances.

gerritholl commented 5 years ago

This triggers a segmentation fault (in the .persist() call) on my system, which may be related:

import xarray
import os
import subprocess
xarray.set_options(file_cache_maxsize=1)
f = "/path/to/netcdf/file.nc"
ds1 = xarray.open_dataset(f, "/group1", chunks=1024)
ds2 = xarray.open_dataset(f, "/group2", chunks=1024)
ds_cat = xarray.concat([ds1, ds2])
ds_cat.persist()
subprocess.run(fr"lsof | grep {os.getpid():d} | grep '\.nc$'", shell=True)

But there's something with the specific netcdf file going on, for when I create artificial groups, it does not segfault.

Fatal Python error: Segmentation fault

Thread 0x00007f542bfff700 (most recent call first):
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 470 in _handle_results
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap

Thread 0x00007f5448ff9700 (most recent call first):
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 422 in _handle_tasks
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap

Thread 0x00007f54497fa700 (most recent call first):
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 413 in _handle_workers
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap

Thread 0x00007f5449ffb700 (most recent call first):
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 110 in worker
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap

Thread 0x00007f544a7fc700 (most recent call first):
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 110 in worker
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap

Thread 0x00007f544affd700 (most recent call first):
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 110 in worker
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap

Thread 0x00007f544b7fe700 (most recent call first):
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 110 in worker
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap

Thread 0x00007f544bfff700 (most recent call first):
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 110 in worker
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap

Thread 0x00007f5458a75700 (most recent call first):
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 110 in worker
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap

Thread 0x00007f5459276700 (most recent call first):
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 110 in worker
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap

Thread 0x00007f5459a77700 (most recent call first):
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/multiprocessing/pool.py", line 110 in worker
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 865 in run
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 917 in _bootstrap_inner
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/threading.py", line 885 in _bootstrap

Current thread 0x00007f54731236c0 (most recent call first):
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 244 in _open_netcdf4_group
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/file_manager.py", line 173 in acquire
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 56 in get_array
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 74 in _getitem
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/indexing.py", line 778 in explicit_indexing_adapter
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 64 in __getitem__
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/indexing.py", line 510 in __array__
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/numpy/core/numeric.py", line 538 in asarray
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/indexing.py", line 604 in __array__
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/numpy/core/numeric.py", line 538 in asarray
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/variable.py", line 213 in _as_array_or_item
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/variable.py", line 392 in values
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/variable.py", line 297 in data
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/variable.py", line 1204 in set_dims
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/combine.py", line 298 in ensure_common_dims
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/variable.py", line 2085 in concat
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/combine.py", line 305 in _dataset_concat
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/core/combine.py", line 120 in concat
  File "mwe13.py", line 19 in <module>
Segmentation fault (core dumped)
shoyer commented 5 years ago

But there's something with the specific netcdf file going on, for when I create artificial groups, it does not segfault.

Can you share a netCDF file that causes this issue?

shoyer commented 5 years ago

Thinking about this a little more, I suspect the issue might be related to how xarray opens a file multiple times to read different groups. It is very likely that libraries like netCDF-C don't handle this properly. Instead, we should probably open files once, and reuse them for reading from different groups.

shoyer commented 5 years ago

OK, I have a tentative fix up in https://github.com/pydata/xarray/pull/3082.

@gerritholl I have not been able to directly reproduce this issue, so it would be great if you could test my pull request before we merge it to verify whether or not the fix works.

gerritholl commented 5 years ago

There are some files triggering the problem at ftp://ftp.eumetsat.int/pub/OPS/out/test-data/Test-data-for-External-Users/MTG_FCI_Test-Data/FCI_L1C_24hr_Test_Data_for_Users/1.0/UNCOMPRESSED/ I will test the PR later (latest on Monday)

gerritholl commented 5 years ago

@shoyer I checked out your branch and the latter test example runs successfully - no segmentation fault and no files left open.

I will test the former test example now.

gerritholl commented 5 years ago

And I can confirm that the problem I reported originally on May 10 is also gone with #3082.