pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.56k stars 1.07k forks source link

open_mfdataset crashes with segfault #2554

Closed shoyer closed 5 years ago

shoyer commented 5 years ago

Copied from the report on the xarray mailing list:


This crashes with SIGSEGV:

# foo.py

import xarray as xr
ds = xr.open_mfdataset('/tmp/nam/bufr.701940/bufr*201012011*.nc', data_vars='minimal', parallel=True)
print(ds)

Traceback:

[gtrojan@asok precip]$ gdb python3 
GNU gdb (GDB) Fedora 8.1.1-3.fc28
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...done.
(gdb) r
Starting program: /mnt/sdc1/local/Python-3.6.5/bin/python3 foo.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffe6dfb700 (LWP 11176)]
[New Thread 0x7fffe4dfa700 (LWP 11177)]
[New Thread 0x7fffdedf9700 (LWP 11178)]
[New Thread 0x7fffdadf8700 (LWP 11179)]
[New Thread 0x7fffd6df7700 (LWP 11180)]
[New Thread 0x7fffd2df6700 (LWP 11181)]
[New Thread 0x7fffcedf5700 (LWP 11182)]
warning: Loadable section ".note.gnu.property" outside of ELF segments
[Thread 0x7fffdadf8700 (LWP 11179) exited]
[Thread 0x7fffd2df6700 (LWP 11181) exited]
[Thread 0x7fffcedf5700 (LWP 11182) exited]
[Thread 0x7fffd6df7700 (LWP 11180) exited]
[Thread 0x7fffdedf9700 (LWP 11178) exited]
[Thread 0x7fffe4dfa700 (LWP 11177) exited]
[Thread 0x7fffe6dfb700 (LWP 11176) exited]
Detaching after fork from child process 11183.
[New Thread 0x7fffcedf5700 (LWP 11184)]
[New Thread 0x7fffe56f1700 (LWP 11185)]
[New Thread 0x7fffdedf9700 (LWP 11186)]
[New Thread 0x7fffdadf8700 (LWP 11187)]
[New Thread 0x7fffd6df7700 (LWP 11188)]
[New Thread 0x7fffd2df6700 (LWP 11189)]
[New Thread 0x7fffa7fff700 (LWP 11190)]
[New Thread 0x7fff9bfff700 (LWP 11191)]
[New Thread 0x7fff93fff700 (LWP 11192)]
[New Thread 0x7fff8bfff700 (LWP 11193)]
[New Thread 0x7fff83fff700 (LWP 11194)]
warning: Loadable section ".note.gnu.property" outside of ELF segments
warning: Loadable section ".note.gnu.property" outside of ELF segments

Thread 9 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffcedf5700 (LWP 11184)]
0x00007fffbd95cca9 in H5SL_insert_common () from /usr/lib64/libhdf5.so.10

This happens with the most recent dask and xarray:

INSTALLED VERSIONS

commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 4.18.14-200.fc28.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: en_CA.UTF-8

xarray: 0.11.0 pandas: 0.23.0 numpy: 1.15.2 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.0.0b1 PseudonetCDF: None rasterio: None iris: None bottleneck: 1.3.0.dev0 cyordereddict: None dask: 0.20.1 distributed: 1.22.1 matplotlib: 3.0.0 cartopy: None seaborn: 0.9.0 setuptools: 39.0.1 pip: 18.1 conda: None pytest: 3.6.3 IPython: 6.3.1 sphinx: 1.8.1

When I change the code in open_mfdataset to use parallel scheduler, the code runs as expected.

Line 619 in api.py:

#datasets, file_objs = dask.compute(datasets, file_objs)
datasets, file_objs = dask.compute(datasets, file_objs, scheduler='processes')

The file sizes are about 300kB, my example reads only 2 files.

shoyer commented 5 years ago

It would be good to know if this occurs with parallel=False.

yt87 commented 5 years ago

No, it works fine.

yt87 commented 5 years ago

Another puzzle, I don't know it is related to the crashes.

Trying to localize the issue I added line after else on line 453 in netCDF4_.py: print('=======', name, encoding.get('chunksizes'))

ds0 = xr.open_dataset('/tmp/nam/bufr.701940/bufr.701940.2010123112.nc') ds0.to_netcdf('/tmp/d0.nc')

This prints:

======= hlcy (1, 85)
======= cdbp (1, 85)
======= hovi (1, 85)
======= itim (1024,)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-aeb92962e874> in <module>()
      1 ds0 = xr.open_dataset('/tmp/nam/bufr.701940/bufr.701940.2010123112.nc')
----> 2 ds0.to_netcdf('/tmp/d0.nc')

/usr/local/Python-3.6.5/lib/python3.6/site-packages/xarray/core/dataset.py in to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute)
   1220                          engine=engine, encoding=encoding,
   1221                          unlimited_dims=unlimited_dims,
-> 1222                          compute=compute)
   1223 
   1224     def to_zarr(self, store=None, mode='w-', synchronizer=None, group=None,

/usr/local/Python-3.6.5/lib/python3.6/site-packages/xarray/backends/api.py in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile)
    718         # to be parallelized with dask
    719         dump_to_store(dataset, store, writer, encoding=encoding,
--> 720                       unlimited_dims=unlimited_dims)
    721         if autoclose:
    722             store.close()

/usr/local/Python-3.6.5/lib/python3.6/site-packages/xarray/backends/api.py in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
    761 
    762     store.store(variables, attrs, check_encoding, writer,
--> 763                 unlimited_dims=unlimited_dims)
    764 
    765 

/usr/local/Python-3.6.5/lib/python3.6/site-packages/xarray/backends/common.py in store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
    264         self.set_dimensions(variables, unlimited_dims=unlimited_dims)
    265         self.set_variables(variables, check_encoding_set, writer,
--> 266                            unlimited_dims=unlimited_dims)
    267 
    268     def set_attributes(self, attributes):

/usr/local/Python-3.6.5/lib/python3.6/site-packages/xarray/backends/common.py in set_variables(self, variables, check_encoding_set, writer, unlimited_dims)
    302             check = vn in check_encoding_set
    303             target, source = self.prepare_variable(
--> 304                 name, v, check, unlimited_dims=unlimited_dims)
    305 
    306             writer.add(source, target)

/usr/local/Python-3.6.5/lib/python3.6/site-packages/xarray/backends/netCDF4_.py in prepare_variable(self, name, variable, check_encoding, unlimited_dims)
    466                 least_significant_digit=encoding.get(
    467                     'least_significant_digit'),
--> 468                 fill_value=fill_value)
    469             _disable_auto_decode_variable(nc4_var)
    470 

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.createVariable()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Variable.__init__()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()

RuntimeError: NetCDF: Bad chunk sizes.

The dataset is:

<xarray.Dataset>
Dimensions:  (dim_1: 1, dim_prof: 60, dim_slyr: 4, ftim: 85, itim: 1)
Coordinates:
  * ftim     (ftim) timedelta64[ns] 00:00:00 01:00:00 ... 3 days 12:00:00
  * itim     (itim) datetime64[ns] 2010-12-31T12:00:00
Dimensions without coordinates: dim_1, dim_prof, dim_slyr
Data variables:
    stnm     (dim_1) float64 ...
    rpid     (dim_1) object ...
    clat     (dim_1) float32 ...
    clon     (dim_1) float32 ...
    gelv     (dim_1) float32 ...
    clss     (itim, ftim) float32 ...
    pres     (itim, ftim, dim_prof) float32 ...
    tmdb     (itim, ftim, dim_prof) float32 ...
    uwnd     (itim, ftim, dim_prof) float32 ...
    vwnd     (itim, ftim, dim_prof) float32 ...
    spfh     (itim, ftim, dim_prof) float32 ...
    omeg     (itim, ftim, dim_prof) float32 ...
    cwtr     (itim, ftim, dim_prof) float32 ...
    dtcp     (itim, ftim, dim_prof) float32 ...
    dtgp     (itim, ftim, dim_prof) float32 ...
    dtsw     (itim, ftim, dim_prof) float32 ...
    dtlw     (itim, ftim, dim_prof) float32 ...
    cfrl     (itim, ftim, dim_prof) float32 ...
    tkel     (itim, ftim, dim_prof) float32 ...
    imxr     (itim, ftim, dim_prof) float32 ...
    pmsl     (itim, ftim) float32 ...
    prss     (itim, ftim) float32 ...
    tmsk     (itim, ftim) float32 ...
    tmin     (itim, ftim) float32 ...
    tmax     (itim, ftim) float32 ...
    wtns     (itim, ftim) float32 ...
    tp01     (itim, ftim) float32 ...
    c01m     (itim, ftim) float32 ...
    srlm     (itim, ftim) float32 ...
    u10m     (itim, ftim) float32 ...
    v10m     (itim, ftim) float32 ...
    th10     (itim, ftim) float32 ...
    q10m     (itim, ftim) float32 ...
    t2ms     (itim, ftim) float32 ...
    q2ms     (itim, ftim) float32 ...
    sfex     (itim, ftim) float32 ...
    vegf     (itim, ftim) float32 ...
    cnpw     (itim, ftim) float32 ...
    fxlh     (itim, ftim) float32 ...
    fxlp     (itim, ftim) float32 ...
    fxsh     (itim, ftim) float32 ...
    fxss     (itim, ftim) float32 ...
    fxsn     (itim, ftim) float32 ...
    swrd     (itim, ftim) float32 ...
    swru     (itim, ftim) float32 ...
    lwrd     (itim, ftim) float32 ...
    lwru     (itim, ftim) float32 ...
    lwrt     (itim, ftim) float32 ...
    swrt     (itim, ftim) float32 ...
    snfl     (itim, ftim) float32 ...
    smoi     (itim, ftim) float32 ...
    swem     (itim, ftim) float32 ...
    n01m     (itim, ftim) float32 ...
    r01m     (itim, ftim) float32 ...
    bfgr     (itim, ftim) float32 ...
    sltb     (itim, ftim) float32 ...
    smc1     (itim, ftim, dim_slyr) float32 ...
    stc1     (itim, ftim, dim_slyr) float32 ...
    lsql     (itim, ftim) float32 ...
    lcld     (itim, ftim) float32 ...
    mcld     (itim, ftim) float32 ...
    hcld     (itim, ftim) float32 ...
    snra     (itim, ftim) float32 ...
    wxts     (itim, ftim) float32 ...
    wxtp     (itim, ftim) float32 ...
    wxtz     (itim, ftim) float32 ...
    wxtr     (itim, ftim) float32 ...
    ustm     (itim, ftim) float32 ...
    vstm     (itim, ftim) float32 ...
    hlcy     (itim, ftim) float32 ...
    cdbp     (itim, ftim) float32 ...
    hovi     (itim, ftim) float32 ...
Attributes:
    model:    Unknown
shoyer commented 5 years ago

@yt87 how much data is necessary to reproduce this? is it feasible to share copies of the problematic files?

yt87 commented 5 years ago

About 600k for 2 files. I could spend some time to try size that down, but if there is a way to upload the the whole set it would be easier for me.

shoyer commented 5 years ago

600 KB? You should be able to attach that to a comment on Github -- you'll just need to combine them into a .zip or .gz file first.

yt87 commented 5 years ago

soundings.zip

I did some further tests, the crash occurs somewhat randomly.

yt87 commented 5 years ago

I meant at random points during execution. The script crashed every time.

yt87 commented 5 years ago

The error RuntimeError: NetCDF: Bad chunk sizes. is unrelated to the original problem with segv crashes. It is caused by a bug in netcdf4 C library. It is fixed in the latest version 4.6.1. As of yesterday, the newest netcdf4-python manylinux wheel contains an older version. The solution is to build netcdf4-python from source.

The segv crashes occur with other datasets as well. Example test set I used:

    file = '/tmp/dx{:d}.nc'.format(year)
    #times = pd.date_range('{:d}-01-01'.format(year), '{:d}-12-31'.format(year), name='time')
    times = pd.RangeIndex(year, year+300, name='time')
    v = np.array([np.random.random((32, 32)) for i in range(times.size)])
    dx = xr.Dataset({'v': (('time', 'y', 'x'), v)}, {'time': times})
    dx.to_netcdf(file, format='NETCDF4', encoding={'time': {'chunksizes': (1024,)}},
                 unlimited_dims='time')

A simple fix is to change the scheduler as I did in my original post.

yt87 commented 5 years ago

After upgrading to anaconda python 3.7 the code works without crashes. I think this issue can be closed.