nexusformat / definitions

Definitions of the NeXus Standard File Structure and Contents
https://manual.nexusformat.org/
Other
26 stars 57 forks source link

How to store NX_CHAR in HDF5? #1308

Open sanbrock opened 1 year ago

sanbrock commented 1 year ago

Documentation says:
The preferred string representation is UTF-8. Both fixed-length strings and variable-length strings are valid.

Please note that HDF5 offers 12 ways for storing strings. The default (and/or available) implementation for Reading/Writing strings by different programming languages (C, Matlab, Py, etc.) in different platforms are different. As a result, data stored by an application might not be possible to be fetched by another one. The only solution is if all applications do implement a complex reading function which checks the data storage and configures the hdf5 library according for reading.
Most of the applications do not do that and so data exchange via NeXus files fails in practice and shows interoperability problems. It seems, the hdf5 backend does require a NAPI implementation to ensure interoperability between applications.

Alternatively, NX_CHAR can/shall be more precisely specified. Which of the 12 string storage methods shall be supported by all applications reading NeXus files:

Requireing a specific type of termination could highly improve the situation. Even if it is decided not to set any further restriction, NeXus documentation should bring awareness to the problem and shall guide programmers on how to properly fetch NX_CHAR from an hdf5 file using different programming languages.

Note that the other 2 properties are already specified:

rayosborn commented 1 year ago

Thanks for raising this, @sanbrock. I was uncomfortable with the decision to stop supporting the NAPI for just this reason, i.e., that the standard was being undermined by the flexibility with which objects are stored. Strings probably create the most problems, but there are other issues, e.g., with scalars vs size-1 arrays (see #881). This is a major reason why I continue to work on the NeXpy and nexusformat packages, which attempt to make sensible decisions about how to read the many ways that people can write standard-conforming NeXus files. I have argued in the past that the NIAC should designate the nexusformat package as official, but there has always been strong push-back from those who prefer to use h5py directly. For experienced users, that's fine, but it can cause the kinds of problems you are describing, with the potential to put off newcomers.

rayosborn commented 1 year ago

I'm not sure from your initial post if you are arguing for a revival of the NAPI, or just a tightening of the NX_CHAR specification. I suspect that the NIAC will be reluctant to reopen the subject of the NAPI, since it's not that long since we officially deprecated it, so the second option seems more likely.

However, if you are proposing a revival of the NAPI, I would like to repeat a point that I made to the NeXus Mailing List last year, when Alba raised the idea of supporting underlying formats other than HDF5.

The reason the API became deprecated is, in my view, that we have depended too much on one or two people (i.e., Mark in the case of the C-API) to take all the responsibility for maintaining it, so that supporting new HDF5 features, such as virtual data sets or even variable-length-strings, fell entirely on them. While an increasing number of facilities have become dependent on NeXus, very few of them, if any, have committed resources to maintaining it, so it has become an activity that many of us squeeze in when we have free time from our other responsibilities. If facilities like Alba, and presumably others, feel that they are being held back by technical limitations of HDF5, then it might be necessary to rethink the NeXus support model, so that API development is revived by being integrated into other facility software development projects, such as Mantid or Dials. It would probably save the facilities money in the long term, but it would require NeXus to be less of a part-time activity. Or we need to encourage more people to contribute, in the way the more successful open-source projects are run.

I still think that it would be cost-effective if NIAC members with the right coding expertise felt able to ask their facility managers if they could spend some of their effort on NAPI, much as many people contributing to other open-source projects have managed to do.

mkuehbach commented 1 year ago

I can contribute Python h5py example code for testing all examples for reading all cases getting Python strings or lists of Python strings out including all possible variants of writing in and out using h5py. I also have C example code to identify the formatting, based on which a specific case-dependent read and write function can be implemented for all variants (currently working on the C part). My suggestion is to make these code snippets available as a say hdf5 small package to begin with for python and c, maybe matlab (I have here only bits and pieces however and this is not a priority) utility function, will discuss this with @sanbrock

Tightening of NX_CHAR is needed (not only for HDF5 but in general) take the entire discussion about characters, strings and glyphs ISO 10646 also the following point here and indeed at least guidelines given for key programming languages and formats whereby NeXus content is written such as HDF5 but there are alternatives at the horizon like zarr which gains traction in the bio/omics/microscopy community.

Another subtlety specifically with NX_CHAR used within HDF5 is the following one: What we also should check if it is properly specified which is the allowed subset of characters, or to be more specific glyphs and taken according to which standard, whereby we enable to spell out names of instances of nodes in our NeXus graphs. This is relevant as HDF5 de facto enables the character set encoding variability of UTF8 vs ASCII also for link names. As links is what every attribute, field, or group instance of HDF5 ultimately gets mapped to by the library, it is currently possible to say e.g. that an instance name e.g sample(NXsample) can use the UTF8 encoding. While we have a strong bias towards English and thus taking a few basic Latin character most of the time there is nothing which currently guides people e.g. to use or not e.g. other say Chinese glyphs for naming groups. My suggestion is to formulate such constraint of allowed glyphs from which instance name glyph arrays (aka "strings") need to be composed again using a rigorous set theoretical approach rather than an intuitive what is an NXCHAR. If we can be explicit I suggest it is best to be explicit to make life for early NeXus adopters easier. E.g. pick a standard e.g. ASCII or UTF-32 or UCS/Unicode and pick the subset of allowed glyphs. An example could be e.g. [AZaz0-9] \in UCS, which is coincidently a subset of ASCII but indeed not all of the allowed ASCII glyphs. Currently, NeXus instance names use these glyph sub-set. Especially a clear strategy is relevant as XML and YAML may also support different character encodings and in this case how to assure that the same encoding rules are used when interconverting from YAML and XML syntax.

rayosborn commented 1 year ago

Such code snippets already exist as part of the nexusformat package. Here is the nxvalue property of the NXfield class, which is designed to normalize the many variants that different NeXus files contain, i.e., converting byte strings to unicode and size-1 arrays to scalars. This property is used, for example, whenever the value of a NXfield containing NX_CHARs is compared to a regular Python string.

    @property
    def nxvalue(self):
        """NXfield value.

        This is the value stored in the NeXus file, with the following
        exceptions.
            1) Size-1 arrays are returned as scalars.
            2) String or byte arrays are returns as a list of strings.

        Notes
        -----
        If unmodified values are required, use the `nxdata` property.
        """
        _value = self.nxdata
        if _value is None:
            return None
        elif (self.dtype is not None and
              (self.dtype.type == np.string_ or self.dtype.type == np.str_ or
               self.dtype == string_dtype)):
            if self.shape == ():
                return text(_value)
            elif self.shape == (1,):
                return text(_value[0])
            else:
                return [text(value) for value in _value[()]]
        elif self.shape == (1,):
            return _value.item()
        else:
            return _value

Here is the text function, which attempts to cope with different encodings, although it's never guaranteed to work.

def text(value):
    """Return a unicode string.

    Parameters
    ----------
    value : str or bytes
        String or byte array to be converted.

    Returns
    -------
    str
        Converted unicode string

    Notes
    -----
    If the argument is a byte array, the function will decode the array using
    the encoding specified by NX_ENCODING, which is initially set to the
    system's default encoding, usually 'utf-8'. If this generates a
    UnicodeDecodeError exception, an alternate encoding is tried. Null
    characters are removed from the return value.
    """
    if isinstance(value, np.ndarray) and value.shape == (1,):
        value = value[0]
    if isinstance(value, bytes):
        try:
            _text = value.decode(NX_CONFIG['encoding'])
        except UnicodeDecodeError:
            if NX_CONFIG['encoding'] == 'utf-8':
                _text = value.decode('latin-1')
            else:
                _text = value.decode('utf-8')
    else:
        _text = str(value)
    return _text.replace('\x00', '').rstrip()

In principle, any h5py code that is intended to read an HDF5 file from any source needs to do something similar. I'm not sure we should be providing such code snippets to the general user. That's what APIs are for.

phyy-nx commented 1 year ago

Resolution from Telco: