HDF5 files with multiple subdatasets causes a segmentation fault

scottstanie commented 1 year ago

Expected behavior and actual behavior.

Using rasterio to read from/inspect HDF5 files with more than one subdataset causes a segfault. This does not happen when there's only one dataset.

Steps to reproduce the problem.

import h5py
import numpy as np
import rasterio as rio

data = np.random.rand(3, 3)
x = y = np.arange(3)

with h5py.File("test_rio.h5", "w") as hf:
    hf['x'] = x
    hf['y'] = y
    hf['x'].make_scale()
    hf['y'].make_scale()

    hf['data'] = data
    hf['data'].dims[0].attach_scale(hf['y'])
    hf['data'].dims[1].attach_scale(hf['x'])

with rio.open("test_rio.h5") as src:
    print(src.profile)
    print("Okay for single dataset")

with h5py.File("test_rio.h5", "a") as hf:
    hf['data2'] = data
    hf['data2'].dims[0].attach_scale(hf['y'])
    hf['data2'].dims[1].attach_scale(hf['x'])

with rio.open("test_rio.h5") as src:
    print(src.profile)
    print("Okay for 2 datasets?")

When I run this, I get

$ python test_rio.py
/Users/staniewi/miniconda3/envs/mapping/lib/python3.10/site-packages/rasterio/__init__.py:304: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.
  dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
{'driver': 'HDF5Image', 'dtype': 'float64', 'nodata': None, 'width': 3, 'height': 3, 'count': 1, 'crs': None, 'transform': Affine(1.0, 0.0, 0.0,
       0.0, 1.0, 0.0), 'blockysize': 1, 'tiled': False}
Okay for single dataset
Segmentation fault: 11

But it's readable by GDAL:

$ gdalinfo test_rio.h5
Driver: HDF5/Hierarchical Data Format Release 5
Files: test_rio.h5
Size is 512, 512
Metadata:
  x_CLASS=DIMENSION_SCALE
  x_NAME=
  x_REFERENCE_LIST=
  y_CLASS=DIMENSION_SCALE
  y_NAME=
  y_REFERENCE_LIST=
Subdatasets:
  SUBDATASET_1_NAME=HDF5:"test_rio.h5"://data
  SUBDATASET_1_DESC=[3x3] //data (64-bit floating-point)
  SUBDATASET_2_NAME=HDF5:"test_rio.h5"://data2
  SUBDATASET_2_DESC=[3x3] //data2 (64-bit floating-point)
Corner Coordinates:
Upper Left  (    0.0,    0.0)
Lower Left  (    0.0,  512.0)
Upper Right (  512.0,    0.0)
Lower Right (  512.0,  512.0)
Center      (  256.0,  256.0)

Environment Information

$ rio --show-versions
rasterio info:
  rasterio: 1.3.6
      GDAL: 3.6.2
      PROJ: 9.1.1
      GEOS: 3.11.1
 PROJ DATA: /u/aurora-r0/staniewi/miniconda3/envs/mapping/share/proj
 GDAL DATA: /u/aurora-r0/staniewi/miniconda3/envs/mapping/share/gdal

System:
    python: 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0]
executable: /u/aurora-r0/staniewi/miniconda3/envs/mapping/bin/python
   machine: Linux-3.10.0-1160.59.1.el7.x86_64-x86_64-with-glibc2.17

Python deps:
    affine: 2.4.0
     attrs: 22.2.0
   certifi: 2023.05.07
     click: 8.1.3
     cligj: 0.7.2
    cython: 0.29.33
     numpy: 1.23.5
    snuggs: 1.4.7
click-plugins: None
setuptools: 67.4.0

Installation Method

conda forge

snowman2 commented 1 year ago

Are you able to provide the input files?

scottstanie commented 1 year ago

Sorry, didn't comment the sample code well. The first block is creating the dummy 3x3 input file:


data = np.random.rand(3, 3)
x = y = np.arange(3)

with h5py.File("test_rio.h5", "w") as hf:
    hf['x'] = x
    hf['y'] = y
    hf['x'].make_scale()
    hf['y'].make_scale()

    hf['data'] = data
    hf['data'].dims[0].attach_scale(hf['y'])
    hf['data'].dims[1].attach_scale(hf['x'])

snowman2 commented 1 year ago

The first block is creating the dummy 3x3 input file

Thanks for pointing that out - I should have looked closer when reading the issue.

sgillies commented 1 year ago

@scottstanie if you upgrade to GDAL 3.6.4 or 3.7.0 and rasterio 1.3.7 do you still see a segfault?

sgillies commented 1 year ago

Confirmed with GDAL 3.6.3 and rasterio 1.3.7

(gdb) run issue2855.py 
Starting program: /venv/bin/python issue2855.py
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7f65b21ff640 (LWP 39)]
[New Thread 0x7f65b19fe640 (LWP 40)]
[New Thread 0x7f65af1fd640 (LWP 41)]
[New Thread 0x7f65aa9fc640 (LWP 42)]
[New Thread 0x7f65a81fb640 (LWP 43)]
[New Thread 0x7f65a79fa640 (LWP 44)]
[New Thread 0x7f65a31f9640 (LWP 45)]
[New Thread 0x7f6590fff640 (LWP 46)]
/app/rasterio/__init__.py:304: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.
  dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
{'driver': 'HDF5Image', 'dtype': 'float64', 'nodata': None, 'width': 3, 'height': 3, 'count': 1, 'crs': None, 'transform': Affine(1.0, 0.0, 0.0,
       0.0, 1.0, 0.0), 'blockysize': 1, 'tiled': False}
Okay for single dataset

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007f659184353f in __Pyx_GetItemInt_Fast (o=0x7f658bd54240, i=0, wraparound=<optimized out>, boundscheck=0, is_list=0) at rasterio/_base.c:29957
29957               PyObject *r = PyList_GET_ITEM(o, n);
(gdb) bt
#0  0x00007f659184353f in __Pyx_GetItemInt_Fast (o=0x7f658bd54240, i=0, wraparound=<optimized out>, 
    boundscheck=0, is_list=0) at rasterio/_base.c:29957
#1  0x00007f659187193a in __pyx_pf_8rasterio_5_base_11DatasetBase_7profile___get__ (
    __pyx_v_self=0x7f658bd2c240) at rasterio/_base.c:16267
#2  __pyx_pw_8rasterio_5_base_11DatasetBase_7profile_1__get__ (__pyx_v_self=0x7f658bd2c240)
    at rasterio/_base.c:16108
#3  __pyx_getprop_8rasterio_5_base_11DatasetBase_profile (o=0x7f658bd2c240, x=<optimized out>)
    at rasterio/_base.c:25544
#4  0x0000558954cd64c8 in _PyObject_GenericGetAttrWithDict ()
#5  0x0000558954cd4a1b in PyObject_GetAttr ()
#6  0x0000558954cc6485 in _PyEval_EvalFrameDefault ()
#7  0x0000558954cbd176 in ?? ()
#8  0x0000558954db2c56 in PyEval_EvalCode ()
#9  0x0000558954ddfb18 in ?? ()
#10 0x0000558954dd896b in ?? ()
#11 0x0000558954ddf865 in ?? ()
#12 0x0000558954dded48 in _PyRun_SimpleFileObject ()
#13 0x0000558954ddea43 in _PyRun_AnyFileObject ()
#14 0x0000558954dcfc3e in Py_RunMain ()
#15 0x0000558954da5bcd in Py_BytesMain ()
#16 0x00007f65b5b47d90 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#17 0x00007f65b5b47e40 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#18 0x0000558954da5ac5 in _start ()

sgillies commented 1 year ago

If I turn Cython's bounds checking back on, I see no crash but an exception.

(gdb) run issue2855.py
Starting program: /venv/bin/python issue2855.py
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fe3a59ff640 (LWP 50)]
[New Thread 0x7fe3a31fe640 (LWP 51)]
[New Thread 0x7fe3a29fd640 (LWP 52)]
[New Thread 0x7fe39e1fc640 (LWP 53)]
[New Thread 0x7fe39b9fb640 (LWP 54)]
[New Thread 0x7fe3991fa640 (LWP 55)]
[New Thread 0x7fe3989f9640 (LWP 56)]
[New Thread 0x7fe3847ff640 (LWP 57)]
/app/rasterio/__init__.py:304: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.
  dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
{'driver': 'HDF5Image', 'dtype': 'float64', 'nodata': None, 'width': 3, 'height': 3, 'count': 1, 'crs': None, 'transform': Affine(1.0, 0.0, 0.0,
       0.0, 1.0, 0.0), 'blockysize': 1, 'tiled': False}
Okay for single dataset
Traceback (most recent call last):
  File "/app/issue2855.py", line 29, in <module>
    print(src.profile)
  File "rasterio/_base.pyx", line 1035, in rasterio._base.DatasetBase.profile.__get__
    m.update(blockysize=self.block_shapes[0][0], tiled=False)
IndexError: list index out of range
[Thread 0x7fe3991fa640 (LWP 55) exited]
[Thread 0x7fe39b9fb640 (LWP 54) exited]
[Thread 0x7fe3a29fd640 (LWP 52) exited]
[Thread 0x7fe3989f9640 (LWP 56) exited]
[Thread 0x7fe39e1fc640 (LWP 53) exited]
[Thread 0x7fe3a31fe640 (LWP 51) exited]
[Thread 0x7fe3a59ff640 (LWP 50) exited]
[Thread 0x7fe3a92f91c0 (LWP 47) exited]
[Thread 0x7fe3847ff640 (LWP 57) exited]
[New process 47]
[Inferior 1 (process 47) exited with code 01]

sgillies commented 1 year ago

In the 2 datasets case, src.block_shapes is []. GDAL is wrong at the top level, too. Size is 512, 512 isn't correct.

This usage is fine:

with rio.open("test_rio.h5") as src:                                                                                                                                                                       
    for sub in src.subdatasets:                                                                                                                                                                            
        with rio.open(sub) as dst:                                                                                                                                                                         
            print(dst.profile)

I'm not sure what src.profile or src.read() should even be in this case. Maybe it's better if rasterio.open() doesn't open if given a multidataset HDF5 file and we provide another utility to get subdataset names?

scottstanie commented 1 year ago

@scottstanie if you upgrade to GDAL 3.6.4 or 3.7.0 and rasterio 1.3.7 do you still see a segfault?

Yep looks like same segfault

$ mamba create  -n testrio rasterio=1.3.7
$ conda activate testrio
(testrio) $ rio --show-versions
rasterio info:
  rasterio: 1.3.7
      GDAL: 3.7.0
      PROJ: 9.2.1
      GEOS: 3.11.2
      ....

(testrio) $ rio info test_rio.h5
/Users/staniewi/miniconda3/envs/testrio/lib/python3.11/site-packages/rasterio/__init__.py:304: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.
  dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
Segmentation fault: 11

And the wrong block shape is interesting, I had wondered in the past why it shows a default (512, 512) block size even when none of the subdatasets are that size.

sgillies commented 1 year ago

We'll have a fix in 1.3.8 @scottstanie.

rasterio / rasterio