Unicode characters unsupported in attributes

davidcurie commented 6 months ago

I am trying to store a usid.Dimension("x", "µm", pos_x_vals) with a unicode label because I want my axes labels on generated graphs to use the proper symbol for microns.

I am greeted with the following error:

File /path/to/python3.12/site-packages/sidpy/base/string_utils.py:337, in clean_string_att(att_val)
    335     return att_val
    336 elif np.any([type(x) in [str, unicode, bytes, np.str_] for x in att_val]):
--> 337     return np.array(att_val, dtype='S')
    338 elif isinstance(att_val, (list, tuple)):
    339     # Not sure how to do this elegantly,
    340     for item in att_val:

UnicodeEncodeError: 'ascii' codec can't encode character '\xb5' in position 0: ordinal not in range(128)

My pipeline otherwise works if I use usid.Dimension("x", "um", pos_x_vals) as demonstrated in some of the example notebooks.

davidcurie commented 6 months ago

I can fix this error for myself by adjusting the offending line in sidpy.base.string_utils to replace the S type with U, like so:

def clean_string_att(att_val):
...
    335     return att_val
    336 elif np.any([type(x) in [str, unicode, bytes, np.str_] for x in att_val]):
--> 337     return np.array(att_val, dtype='U')
    338 elif isinstance(att_val, (list, tuple)):

Is this an easy port to implement upstream, or are there legacy concerns for why this needs to be S?

davidcurie commented 6 months ago

I was wrong.

While my np.array(att_val, dtype="U") adjustment fixes my unicode conversion on simple test strings and isolated test cases, it ultimately fails when I try to save it to USID-formatted .hdf5.

The problem then becomes that the labels attribute fails to update on either ancillary dataset.

File "/path/to/python3.12/site-packages/pyUSID/io/hdf_utils/model.py", line 936, in write_main_dataset
    link_as_main(h5_main, h5_pos_inds, h5_pos_vals, h5_spec_inds, h5_spec_vals)
  File "/path/to/python3.12/site-packages/pyUSID/io/hdf_utils/simple.py", line 557, in link_as_main
    return USIDataset(h5_main)
           ^^^^^^^^^^^^^^^^^^^
  File "/path/to/python3.12/site-packages/pyUSID/io/usi_data.py", line 91, in __init__
    if not check_if_main(h5_ref):
           ^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/python3.12/site-packages/pyUSID/io/hdf_utils/simple.py", line 454, in check_if_main
    validate_anc_dset_attrs(h5_pos_inds, h5_pos_vals, is_spec=False)
  File "/path/to/python3.12/site-packages/pyUSID/io/hdf_utils/simple.py", line 488, in validate_anc_dset_attrs
    v_names = get_attr(h5_vals, 'labels')
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/python3.12/site-packages/sidpy/hdf/hdf_utils.py", line 125, in get_attr
    raise KeyError("'{}' is not an attribute in '{}'".format(attr_name, h5_object.name))
KeyError: "'labels' is not an attribute in '/Measurement_000/Channel_000/Position_Values'"

For clarity, my construction pipeline is based on the following example, which uses both sidpy and pyUSID. I am uncertain if the true scope of this issue belongs in sidpy or pyUSID.

import pyUSID as usid
import sidpy
import h5py

⋮
pos_dims = [
   usid.Dimension("x", "µm", pos_x_vals),
   usid.Dimension("y", "µm", pos_y_vals),
]
spec_dims = [usid.Dimension("Wavelength", "nm", spec_vals)]

with h5py.File(path, "w") as h5:
    ⋮
    h5_meas_group = sidpy.prov_utils.create_indexed_group(h5, "Measurement")
    h5_channel_group = sidpy.prov_utils.create_indexed_group(h5_meas_group, "Channel")
    h5_raw = usid.hdf_utils.write_main_dataset(...
                pos_dims=pos_dims,  # Position dimensions
                spec_dims=spec_dims,  # Spectroscopic dimensions
                )

pyUSID: 0.0.12 sidpy: 0.12.3 h5py: 3.11.0

ramav87 commented 6 months ago

I can confirm that sidpy datasets do support unicode attributes and can be saved to hdf5 and recovered without issue. However, you are correct that it seems this is not functioning correctly with USID. I will delve into this a bit further and let you know.

ramav87 commented 5 months ago

I instituted a fix that works on my end, in this commit

Can you please let me know if it resolves your issue? I tested it with a USID dataset and it seemed to function:

num_rows = 32
num_cols = 32
rows_vals = np.linspace(-0.5, 0.5, 32)
cols_vals = np.linspace(2.5, 3.5, 32)

pos_dims = [usid.Dimension('Cols', 'µm', cols_vals),
            usid.Dimension('Rows', 'µm', rows_vals)]

spec_dims = [usid.Dimension('Grays', 'a.u.', np.array([0]))]

new_h5_filename = r'new_exp_file5.hf5'

hf = h5py.File(new_h5_filename, 'a') #Create a new file

hf_meas_grp = hf.create_group('Measurement_000/Channel_000') #Create a group

h5_raw = usid.hdf_utils.write_main_dataset(hf_meas_grp,  # parent HDF5 group
                                           (num_rows * num_cols, 1),  # shape of Main dataset
                                           'Raw_Data',  # Name of main dataset
                                           'Image',  # Physical quantity contained in Main dataset
                                           'nA',  # Units for the physical quantity
                                           pos_dims,  # Position dimensions
                                           spec_dims,  # Spectroscopic dimensions
                                           dtype=np.float32,  # data type / precision
                                           compression='gzip',
                                           main_dset_attrs={'IO_rate': 4E+6})

print(h5_raw.pos_dim_descriptors)

which yielded

['Rows (µm)', 'Cols (µm)']

This fix is now merged into main, so hopefully if you use sidpy from the main branch you will no longer run into this problem.

davidcurie commented 5 months ago

Yes, the above fix solves my issue with no further adjustments to my pipeline. Thanks!

pycroscopy / sidpy

Unicode characters unsupported in attributes #206