nexusformat / definitions

Definitions of the NeXus Standard File Structure and Contents
https://manual.nexusformat.org/
Other
26 stars 57 forks source link

Improve text around handling strings #1480

Open tacaswell opened 1 week ago

tacaswell commented 1 week ago

from https://manual.nexusformat.org/nxdl-types.html#data-types-allowed-in-nxdl-specifications

NX_CHAR: The preferred string representation is UTF-8. Both fixed-length strings and variable-length strings are valid. String arrays cannot be used where only a string is expected (title, start_time, end_time, NX_class attribute,…). Fields or attributes requiring the use of string arrays will be clearly marked as such (like the NXdata attribute auxiliary_signals). This is the default field type.

At the nexus level we should decide if "NX_CHAR" is "sequence-of-char-as-8-byte-good-luck-with-encoding" a-la c or "sequence of unicode points" a-la strings in modern programming languages.

If it is the second then we should drop the sentence, If it is the first we should at least change the language to be "encoding" (rather than the representation), possible change to "when using hdf5 use the utf-8 enocding", or still consider dropping it.

For reference the h5py docs on strings: https://docs.h5py.org/en/stable/strings.html#strings and notes on encoding https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/


I think we should go with the second option and assert that the details of the encoding are the business of the underlying file format not of nexus proper (any more than we would pull endianess up into nexus). In the case of xml, the whole file has an encoding (which should be at the top!) and hdf5 (and h5py) can also handle this:

import h5py
a = '你好世界'
with open('/tmp/test.h5', 'w') as f:
   f['a'] = [a, a+a, 'bob']

which if we poke at the files gives:

In [10]: f = h5py.File('/tmp/test.h5')

In [11]: f['a'].dtype
Out[11]: dtype('O')

In [12]: f['a'].asstr()[:]
Out[12]: array(['你好世界', '你好世界你好世界', 'bob'], dtype=object)

In [13]: f['a'][:]
Out[13]: 
array([b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c',
       b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c',
       b'bob'], dtype=object)

This works because h5py does put the encoding in the hdf5 file:

In [24]: from h5py import h5t

In [25]: h5t.check_string_dtype(f['a'].dtype)
Out[25]: string_info(encoding='utf-8', length=None)

and you can see it in h5dump

@ h5dump /tmp/test.h5
HDF5 "/tmp/test.h5" {
GROUP "/" {
   DATASET "a" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
      DATA {
      (0): "\37777777744\37777777675\37777777640\37777777745\37777777645\37777777675\37777777744\37777777670\37777777626\37777777747\37777777625\37777777614",
      (1): "\37777777744\37777777675\37777777640\37777777745\37777777645\37777777675\37777777744\37777777670\37777777626\37777777747\37777777625\37777777614\37777777744\37777777675\37777777640\37777777745\37777777645\37777777675\37777777744\37777777670\37777777626\37777777747\37777777625\37777777614",
      (2): "bob"
      }
   }
}
}

which shows h5py is doing this using what I believe are standard hdf5 tools so I would expect this to be available to any language.

tacaswell commented 1 week ago

https://docs.hdfgroup.org/archive/support/HDF5/doc/RM/RM_H5T.html#Datatype-SetCset is the upstream hdf5 docs which say this is available from 1.8 on and the only supported encodings are ASCII and utf-8