openPMD / openPMD-validator

:ballot_box_with_check: Validator and Example Scripts
ISC License
4 stars 8 forks source link

HDF5 validator incorrectly handles attributes with variable-length string arrays #61

Open agolovanov opened 3 years ago

agolovanov commented 3 years ago

HDF5 supports two ways of storing an array of strings: fixed-length and variable-length.

openPMD uses arrays of strings for some attributes, for example, for axisLabels. When a fixed-length array is used,

// h5dump output
ATTRIBUTE "axisLabels" {
    DATATYPE  H5T_STRING {
        STRSIZE 2;
        STRPAD H5T_STR_NULLTERM;
        CSET H5T_CSET_ASCII;
        CTYPE H5T_C_S1;
    }
    DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
    DATA {
    (0): "x", "z"
    }
}

openPMD-validator considers that a valid attribute. However, when a variable-length array is used,

ATTRIBUTE "axisLabels" {
    DATATYPE  H5T_STRING {
        STRSIZE H5T_VARIABLE;
        STRPAD H5T_STR_NULLTERM;
        CSET H5T_CSET_ASCII;
        CTYPE H5T_C_S1;
    }
    DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
    DATA {
    (0): "x", "z"
    }
}

openPMD-validator fails with the following error message:

Error: Attribute axisLabels in `/data/0/meshes/inv` is not of type ndarray of '<map object at 0x7fe5256acbb0>' (is ndarray of 'object_')!

As variable-length string arrays are a legitimate feature of the HDF5 data format, and the openPMD standard does not explicitly ban using this feature (it only states that axisLabels should be "1-dimensional array containing N (string) elements", which is satisfied in both cases), I believe using variable-length should not violate the openPMD standard, and thus the openPMD-validator should not fail in this case.

This probably happens because internally h5py represents variable-length string arrays as np.ndarray with dtype=object instead of numpy string type (see https://docs.h5py.org/en/stable/special.html). Because of that, instead of using arr.dtype.type (which gives np.object_ for variable-length arrays), the validator should use the h5py.check_string_dtype(arr.dtype) method which correctly works both with fixed- and variable-length string arrays.

Attached are two example output files with fixed- and variable-length used for axisLabels: examples.zip