pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.56k stars 1.07k forks source link

Segmentation fault with open_datatree(engine='netCDF4') or xr.open_dataset(engine='netCDF4') with data variables that are a string type #9093

Closed eni-awowale closed 3 months ago

eni-awowale commented 3 months ago

What happened?

Hi everyone, Excited to report my first bug 🐛! I have been creating some grouped test netCDF4 files for unit testing our internal repository. I started getting segmentation faults when I added variables of a string datatype. This only happens with engine='netCDF4'. When you change the engine to 'h5netcdf' there are no segmentation faults. We are thinking this has something to do with the netcdf4-c library. However, I have only been able to replicate this issue with open_datatree(), with the engine set to the default netCDF4 library. And not with nc4.Dataset() or xr.open_dataset(). My colleague @lsterzinger has been getting segmentation faults with all three of these methods and will elaborate on this thread.

We've been able to narrow this down to a problem with data variables with a non-numerical datatype, by creating netCDF4 files with variables of a string datatype, np.dtype('<U4'). open_datatree() seg faults after the fourth call (see example below). I have not been able to replicate segmentation faults for netCDF4 files without string data variables, even with a thousand calls to open_datatree() or with the engine set to 'h5netcdf' for datasets with string variables.

To replicate this error:

In [9]: max_retry = 0
   ...: while max_retry < 15:
   ...:     oco2_tree = open_datatree(
   ...:         "./downloads/OCO2_L2_Lite_SIF.11r/oco2_LtSIF_220101_B11012Ar_220627180315s.nc4" )
   ...:     max_retry += 1
   ...:     print(max_retry)
1
2
3
4
Segmentation fault

In a docker container we are running the netcdf-c library version '4.8.1 and we are building the netCDF4 python library from source. On my local machine I am running netcdf-c library version 4.9.3-development. I have been getting the segmentation faults on both machines.

Data source

Here is the granule data download link from our online archive. It has non-numerical datatypes, specifically string and datetime types.

Granule tree structure:

In [4]: open_datatree(
   ...:     "./downloads/OCO2_L2_Lite_SIF.11r/oco2_LtSIF_220101_B11012Ar_220627180315s.nc4"
   ...: )
Out[4]: 
DataTree('None', parent=None)
│   Dimensions:                (sounding_dim: 188677, vertex_dim: 4)
│   Dimensions without coordinates: sounding_dim, vertex_dim
│   Data variables: (12/15)
│       Delta_Time             (sounding_dim) float64 2MB ...
│       SZA                    (sounding_dim) float32 755kB ...
│       VZA                    (sounding_dim) float32 755kB ...
│       SAz                    (sounding_dim) float32 755kB ...
│       VAz                    (sounding_dim) float32 755kB ...
│       Longitude              (sounding_dim) float32 755kB ...
│       ...                     ...
│       SIF_740nm              (sounding_dim) float32 755kB ...
│       SIF_Uncertainty_740nm  (sounding_dim) float32 755kB ...
│       Daily_SIF_740nm        (sounding_dim) float32 755kB ...
│       Daily_SIF_757nm        (sounding_dim) float32 755kB ...
│       Daily_SIF_771nm        (sounding_dim) float32 755kB ...
│       Quality_Flag           (sounding_dim) float64 2MB ...
│   Attributes: (12/32)
│       References:                        ['Sun, Y. et al., Remote Sensing of En...
│       conventions:                       CF-1.6
│       product_version:                   B11012Ar
│       summary:                           Fraunhofer-line based SIF retrievals
│       keywords:                          ISS, OCO-2, Solar Induced Fluorescence...
│       keywords_vocabulary:               NASA Global Change Master Directory (G...
│       ...                                ...
│       InputBuildId:                      B11.0.06
│       InputPointers:                     oco2_L2MetGL_39883a_211231_B11006r_220...
│       CoordSysBuilder:                   ucar.nc2.dataset.conv.CF1Convention
│       identifier_product_doi_authority:  http://dx.doi.org/
│       gesdisc_collection:                11r
│       identifier_product_doi:            10.5067/OTRE7KQS8AU8
├── DataTree('Cloud')
│       Dimensions:             (sounding_dim: 188677)
│       Dimensions without coordinates: sounding_dim
│       Data variables:
│           surface_albedo_abp  (sounding_dim) float32 755kB ...
│           cloud_flag_abp      (sounding_dim) float64 2MB ...
│           delta_pressure_abp  (sounding_dim) float32 755kB ...
│           co2_ratio           (sounding_dim) float32 755kB ...
│           o2_ratio            (sounding_dim) float32 755kB ...
├── DataTree('Geolocation')
│       Dimensions:                       (sounding_dim: 188677, vertex_dim: 4)
│       Dimensions without coordinates: sounding_dim, vertex_dim
│       Data variables:
│           time_tai93                    (sounding_dim) datetime64[ns] 2MB ...
│           solar_zenith_angle            (sounding_dim) float32 755kB ...
│           solar_azimuth_angle           (sounding_dim) float32 755kB ...
│           sensor_zenith_angle           (sounding_dim) float32 755kB ...
│           sensor_azimuth_angle          (sounding_dim) float32 755kB ...
│           altitude                      (sounding_dim) float32 755kB ...
│           longitude                     (sounding_dim) float32 755kB ...
│           latitude                      (sounding_dim) float32 755kB ...
│           footprint_longitude_vertices  (sounding_dim, vertex_dim) float32 3MB ...
│           footprint_latitude_vertices   (sounding_dim, vertex_dim) float32 3MB ...
├── DataTree('Metadata')
│       Dimensions:          (sounding_dim: 188677)
│       Dimensions without coordinates: sounding_dim
│       Data variables:
│           CollectionLabel  <U17 68B ...
│           BuildId          <U8 32B ...
│           OrbitId          (sounding_dim) float64 2MB ...
│           SoundingId       (sounding_dim) float64 2MB ...
│           FootprintId      (sounding_dim) float64 2MB ...
│           MeasurementMode  (sounding_dim) float64 2MB ...
├── DataTree('Meteo')
│       Dimensions:                 (sounding_dim: 188677)
│       Dimensions without coordinates: sounding_dim
│       Data variables:
│           surface_pressure        (sounding_dim) float32 755kB ...
│           specific_humidity       (sounding_dim) float32 755kB ...
│           vapor_pressure_deficit  (sounding_dim) float32 755kB ...
│           temperature_skin        (sounding_dim) float32 755kB ...
│           temperature_two_meter   (sounding_dim) float32 755kB ...
│           wind_speed              (sounding_dim) float32 755kB ...
├── DataTree('Offset')
│       Dimensions:                    (signalbin_dim: 227, footprint_dim: 8,
│                                       statistics_dim: 2)
│       Dimensions without coordinates: signalbin_dim, footprint_dim, statistics_dim
│       Data variables: (12/13)
│           signal_histogram_bins      (signalbin_dim) float32 908B ...
│           signal_histogram_757nm     (signalbin_dim, footprint_dim) float64 15kB ...
│           signal_histogram_771nm     (signalbin_dim, footprint_dim) float64 15kB ...
│           SIF_Relative_Mean_757nm    (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
│           SIF_Mean_757nm             (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
│           SIF_Relative_Median_757nm  (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
│           ...                         ...
│           SIF_Relative_SDev_757nm    (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
│           SIF_Relative_Mean_771nm    (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
│           SIF_Mean_771nm             (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
│           SIF_Relative_Median_771nm  (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
│           SIF_Median_771nm           (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
│           SIF_Relative_SDev_771nm    (signalbin_dim, footprint_dim, statistics_dim) float32 15kB ...
├── DataTree('Science')
│       Dimensions:                        (sounding_dim: 188677)
│       Dimensions without coordinates: sounding_dim
│       Data variables: (12/16)
│           sounding_qual_flag             (sounding_dim) float64 2MB ...
│           IGBP_index                     (sounding_dim) float64 2MB ...
│           continuum_radiance_757nm       (sounding_dim) float32 755kB ...
│           SIF_757nm                      (sounding_dim) float32 755kB ...
│           SIF_Unadjusted_757nm           (sounding_dim) float32 755kB ...
│           SIF_Relative_757nm             (sounding_dim) float32 755kB ...
│           ...                             ...
│           SIF_Unadjusted_771nm           (sounding_dim) float32 755kB ...
│           SIF_Relative_771nm             (sounding_dim) float32 755kB ...
│           SIF_Unadjusted_Relative_771nm  (sounding_dim) float32 755kB ...
│           SIF_Uncertainty_771nm          (sounding_dim) float32 755kB ...
│           daily_correction_factor        (sounding_dim) float32 755kB ...
│           sounding_land_fraction         (sounding_dim) float32 755kB ...
└── DataTree('Sequences')
        Dimensions:         (sequences_dim: 0, sounding_dim: 188677)
        Dimensions without coordinates: sequences_dim, sounding_dim
        Data variables:
            SequencesName   (sequences_dim) <U1 0B ...
            SequencesId     (sequences_dim) <U1 0B ...
            SequencesMode   (sequences_dim) <U1 0B ...
            SequencesIndex  (sounding_dim) float64 2MB ...
            SegmentsIndex   (sounding_dim) float64 2MB ...

What did you expect to happen?

I expected `open_datatree(engine='netCDF4') to return DataTree object. Instead it seg faults.

Minimal Complete Verifiable Example

max_retry = 0
while max_retry < 15:
    oco2_tree = open_datatree('./OCO2_L2_Lite_SIF.11r/oco2_LtSIF_220101_B11012Ar_220627180315s.nc4')
    max_retry += 1
    print(max_retry)

MVCE confirmation

Relevant log output

platform linux -- Python 3.12.4, pytest-8.2.2, pluggy-1.5.0 -- /usr/local/bin/python
cachedir: .pytest_cache
rootdir: /usr/src/app
configfile: pyproject.toml
plugins: subtests-0.12.1, inline-snapshot-0.10.2, cov-5.0.0, anyio-4.4.0
collected 28 items                                                                                                         

tests/test_compare.py::test_smoke_test PASSED                                                                        [  3%]
tests/test_compare.py::test_class_auto_runs_one_test PASSED                                                          [  7%]
tests/test_compare.py::test_compare_global_attrs_keys_values PASSED                                                  [ 10%]
tests/test_compare.py::test_get_intersection Fatal Python error: Segmentation fault

Current thread 0x0000ffffb3b63020 (most recent call first):
  File "/usr/local/lib/python3.12/site-packages/xarray/backends/file_manager.py", line 217 in _acquire_with_cache_info
  File "/usr/local/lib/python3.12/site-packages/xarray/backends/file_manager.py", line 199 in acquire_context
  File "/usr/local/lib/python3.12/contextlib.py", line 137 in __enter__
  File "/usr/local/lib/python3.12/site-packages/xarray/backends/netCDF4_.py", line 412 in _acquire
  File "/usr/local/lib/python3.12/site-packages/xarray/backends/netCDF4_.py", line 418 in ds
  File "/usr/local/lib/python3.12/site-packages/xarray/backends/netCDF4_.py", line 356 in __init__
  File "/usr/local/lib/python3.12/site-packages/xarray/backends/netCDF4_.py", line 409 in open
  File "/usr/local/lib/python3.12/site-packages/xarray/backends/netCDF4_.py", line 646 in open_dataset
  File "/usr/local/lib/python3.12/site-packages/xarray/backends/api.py", line 571 in open_dataset
  File "/usr/local/lib/python3.12/site-packages/datatree/io.py", line 66 in _open_datatree_netcdf
  File "/usr/local/lib/python3.12/site-packages/datatree/io.py", line 58 in open_datatree
  File "/usr/src/app/regression_tests/compare.py", line 40 in to_xarray_datatree
  File "/usr/src/app/regression_tests/compare.py", line 29 in __init__
  File "/usr/src/app/tests/test_compare.py", line 34 in __init__
  File "/usr/src/app/tests/test_compare.py", line 113 in variable_comparison_class_test_data_a_b_fixture
  File "/usr/local/lib/python3.12/site-packages/_pytest/fixtures.py", line 880 in call_fixture_func
  File "/usr/local/lib/python3.12/site-packages/_pytest/fixtures.py", line 1125 in pytest_fixture_setup
  File "/usr/local/lib/python3.12/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/usr/local/lib/python3.12/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/usr/local/lib/python3.12/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/usr/local/lib/python3.12/site-packages/_pytest/fixtures.py", line 1076 in execute
  File "/usr/local/lib/python3.12/site-packages/_pytest/fixtures.py", line 606 in _get_active_fixturedef
  File "/usr/local/lib/python3.12/site-packages/_pytest/fixtures.py", line 521 in getfixturevalue
  File "/usr/local/lib/python3.12/site-packages/_pytest/fixtures.py", line 686 in _fillfixtures
  File "/usr/local/lib/python3.12/site-packages/_pytest/python.py", line 1635 in setup
  File "/usr/local/lib/python3.12/site-packages/_pytest/runner.py", line 514 in setup
  File "/usr/local/lib/python3.12/site-packages/_pytest/runner.py", line 159 in pytest_runtest_setup
  File "/usr/local/lib/python3.12/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/usr/local/lib/python3.12/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/usr/local/lib/python3.12/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/usr/local/lib/python3.12/site-packages/_pytest/runner.py", line 241 in <lambda>
  File "/usr/local/lib/python3.12/site-packages/_pytest/runner.py", line 341 in from_call
  File "/usr/local/lib/python3.12/site-packages/_pytest/runner.py", line 240 in call_and_report
  File "/usr/local/lib/python3.12/site-packages/_pytest/runner.py", line 129 in runtestprotocol
  File "/usr/local/lib/python3.12/site-packages/_pytest/runner.py", line 116 in pytest_runtest_protocol
  File "/usr/local/lib/python3.12/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/usr/local/lib/python3.12/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/usr/local/lib/python3.12/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/usr/local/lib/python3.12/site-packages/_pytest/main.py", line 364 in pytest_runtestloop
  File "/usr/local/lib/python3.12/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/usr/local/lib/python3.12/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/usr/local/lib/python3.12/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/usr/local/lib/python3.12/site-packages/_pytest/main.py", line 339 in _main
  File "/usr/local/lib/python3.12/site-packages/_pytest/main.py", line 285 in wrap_session
  File "/usr/local/lib/python3.12/site-packages/_pytest/main.py", line 332 in pytest_cmdline_main
  File "/usr/local/lib/python3.12/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/usr/local/lib/python3.12/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/usr/local/lib/python3.12/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/usr/local/lib/python3.12/site-packages/_pytest/config/__init__.py", line 178 in main
  File "/usr/local/lib/python3.12/site-packages/_pytest/config/__init__.py", line 206 in console_main
  File "/usr/local/bin/pytest", line 8 in <module>

Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, cftime._cftime, netCDF4._netCDF4 (total: 58)
Segmentation fault

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.12.4 (main, Jun 7 2024, 19:15:23) [GCC 12.2.0] python-bits: 64 OS: Linux OS-release: 6.6.16-linuxkit machine: aarch64 processor: byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('C', 'UTF-8') libhdf5: 1.10.8 libnetcdf: 4.8.1 xarray: 2024.5.0 pandas: 2.2.2 numpy: 1.26.4 scipy: None netCDF4: 1.6.5 pydap: None h5netcdf: 1.3.0 h5py: 3.11.0 zarr: None cftime: 1.6.4 nc_time_axis: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 70.0.0 pip: 24.0 conda: None pytest: 8.2.2 mypy: None IPython: 8.0.1 sphinx: None
welcome[bot] commented 3 months ago

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

TomNicholas commented 3 months ago

That is absolutely gnarly, and surprising that you can use xr.open_dataset just fine seeing as the current implementation of open_datatree just calls xr.open_dataset(group=...) repeatedly.

Does this bug still happen if you check out the re-implemented version of open_datatree in https://github.com/pydata/xarray/pull/9014?

EDIT: I also wonder if perhaps this could be reproduced by calling xr.open_dataset on the same group many times? I'm struggling to see why the datatree component of this would be necessary to reproduce it.

lsterzinger commented 3 months ago

Hey Tom,

I actually just replicated this in our docker image with netCDF4.Dataset() by specifying group='Sequences', but this only happened once and I'm unable to replicate again.

The docker image uses the python 3.12 base image. Eni and I both have M1 Macs so we use the linux/arm64 build of this image. I'm going to try replicating this on the linux/amd64 base image and see what happens.

On my own machine (M1 Mac) with netcdf-c 4.8.2 , I do indeed replicate the segmentation fault with booth repeatedly looping xr.open_dataset() and netCDF4.Dataset(). But this happens regardless of whether I choose the group with the string-type variables in it (Sequences) or another group that doesn't (Science)

Worth mentioning that we are building netcdf-c from source in our docker image - due to a persistent issue with NASA Earthdata Login requiring a specific version of netcdf-c not in the linux repo for our image. But no special options are given to the configuration of that.

I agree with Eni that this seems to be something to do with netcdf-c - likely some built-in caching that we don't have an interface for. I don't see any of these issues when reading the same dataset with the hdf5 library.

eni-awowale commented 3 months ago

@TomNicholas I was able to replicate this issue with nc4.open_dataset(). It failed after the third retry. I will edit the title since this is not isolated to open_datatree()

TomNicholas commented 3 months ago

Okay thanks both - so I will close this as an upstream issue then?

eni-awowale commented 3 months ago

Thanks Tom! Sounds good. I might open the issue directly in the netCDF4 python library.