prjemian / punx

Python Utilities for NeXus HDF5 files
https://prjemian.github.io/punx
5 stars 7 forks source link

API: Render axes attributes as list of double quoted strings #155

Closed carterbox closed 2 years ago

carterbox commented 2 years ago

Previously, arrays of strings or byte arrays were truncated to only show the first element, but this is incompatible with the Nexus standard which expects axes to be provided as an array of strings.

This PR changes the behavior of utils.decode_byte_string() and h5tree.renderAttribute so that all types of strings are converted to strings and arrays of strings are converted to lists of strings. This allows rendering of strings and lists of strings with double quotes. However, it also causes some knock-on effects where unrelated functions which always expected a string, now need to deal with lists of strings. These cases which caused existing tests to fail were patched.

Closes #129

prjemian commented 2 years ago

Actually, this is not quite what I had in mind. More like (with this file created):

import h5py

with h5py.File("/tmp/demo.h5", "w") as h5:
    nxentry = h5.create_group("Scan")
    nxentry.attrs["NX_class"] = "NXentry"

    nxdata = nxentry.create_group("data")
    nxdata.attrs["NX_class"] = "NXdata"
    nxdata.attrs["axes"] = "two_theta"
    nxdata.attrs["signal"] = "counts"
    nxdata.attrs["two_theta_indices"] = 0
    nxdata.attrs["multi_str"] = "one two three".split()

    ds = nxentry.create_dataset(
        "counts",
        data=[
            1037, 1318, 1704, 2857, 4516
        ]
    )
    ds.attrs["units"] = "counts"

    ds = nxentry.create_dataset(
        "two_theta",
        data=[
            17.92608, 17.92591, 17.92575, 17.92558, 17.92541
        ]
    )
    ds.attrs["units"] = "degrees"

Should give a report like this:

(bluesky_2022_1) prjemian@zap:~/.../prjemian/punx$ punx tree /tmp/demo.h5 

!!! WARNING: this program is not ready for distribution.

/tmp/demo.h5 : NeXus data file
  Scan:NXentry
    @NX_class = NXentry
    counts:NX_INT64[5] = [1037, 1318, 1704, 2857, 4516]
      @units = counts
    two_theta:NX_FLOAT64[5] = [17.92608, 17.92591, 17.92575, 17.92558, 17.92541]
      @units = degrees
    data:nxdata
      @NX_class = NXdata
      @axes = two_theta
      @multi_str = ["one", "two", "three"]
      @signal = counts
      @two_theta_indices = 0

Today, the code renders this one attribute: @multi_str = one (it only gets the first element in the list)

Thinking that all string attributes should be rendered with double quotes and lists of strings should be rendered with square brackets.

prjemian commented 2 years ago

more like this:

(bluesky_2022_1) prjemian@zap:~/.../prjemian/punx$ punx tree /tmp/demo.h5 

!!! WARNING: this program is not ready for distribution.

/tmp/demo.h5 : NeXus data file
  Scan:NXentry
    @NX_class = "NXentry"
    counts:NX_INT64[5] = [1037, 1318, 1704, 2857, 4516]
      @units = "counts"
    two_theta:NX_FLOAT64[5] = [17.92608, 17.92591, 17.92575, 17.92558, 17.92541]
      @units = "degrees"
    data:nxdata
      @NX_class = "NXdata"
      @axes = "two_theta"
      @multi_str = ["one", "two", "three"]
      @signal = "counts"
      @two_theta_indices = 0
carterbox commented 2 years ago

I changed the behavior of decode_byte_string because I don't think anyone would use numpy.array(["this is a test"]). When they should be doing numpy.array(["this", "is", "a", "test"]). Is that OK?

carterbox commented 2 years ago

I don't know why the CI passes. There are a few failing tests for me locally!

prjemian commented 2 years ago

Please copy the complete the console failure report here.

carterbox commented 2 years ago
/home/dching/Documents/punx/punx/data/Data_Q.h5 : NeXus data file
  sasentry01:NXentry
    @NX_class = ["NXentry"]
    @canSAS_class = ["SASentry"]
    end_time:NX_CHAR = b'2014-04-25T09:34:06.087'
    start_time:NX_CHAR = b'2014-04-25T09:33:43.637'
    sasdata01:NXdata
      @I_axes = ["Q,Q"]
      @NX_class = ["NXdata"]
      @Q_indices = ["0,1"]
      @canSAS_class = ["SASdata"]
      @canSAS_version = ["0.1"]
      @signal = ["I"]
      I:NX_FLOAT32[100,100] = __array
        __array = [
            [0.0, 0.0, 0.0, '...', 0.0]
            [0.0, 0.0, 0.0, '...', 0.0]
            [0.0, 0.0, 0.0, '...', 0.0]
            ...
            [0.0, 0.0, 0.0, '...', 0.0]
          ]
        @IGORWaveType = 2
        @axes = ["Q"]
        @signal = [1]
        @uncertainty = [""]
      Q:NX_FLOAT32[100,100] = __array
        __array = [
            [0.46678692, 0.45850673, 0.45038438, '...', 0.45968014]
            [0.45850673, 0.4500742, 0.44179693, '...', 0.45126957]
            [0.45038444, 0.44179687, 0.43336153, '...', 0.4430146]
            ...
            [0.45968014, 0.45126957, 0.4430146, '...', 0.45246175]
          ]
        @IGORWaveType = 2

This example file for the validation tests is now "wrong" because it uses a numpy array of a single string. What should I do about that?

carterbox commented 2 years ago

The validate tests are currently failing where they expect a string but get a list instead. For example:

punx/validate.py:239: in validate
    self.build_address_catalog()
punx/validate.py:269: in build_address_catalog
    self._group_address_catalog_(None, self.h5)
punx/validate.py:297: in _group_address_catalog_
    self._group_address_catalog_(parent, group[item])
punx/validate.py:293: in _group_address_catalog_
    obj = get_subject(parent, group)
punx/validate.py:283: in get_subject
    v = ValidationItem(parent, o)
punx/validate.py:409: in __init__
    self.classpath = self.determine_NeXus_classpath()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <punx.validate.ValidationItem object at 0x7f062d75f700>

    def determine_NeXus_classpath(self):
        """
        determine the NeXus class path

        :see: http://download.nexusformat.org/sphinx/preface.html#class-path-specification

        EXAMPLE

        Given this NeXus data file structure::

            /
                entry: NXentry
                    data: NXdata
                        @signal = data
                        data: NX_NUMBER

        For the "signal" attribute of this HDF5 address: ``/entry/data``,
        its NeXus class path is: ``/NXentry/NXdata@signal``

        The ``@signal`` attribute has the value of ``data`` which means
        that the local field named ``data`` is the plottable data.

        The HDF5 address of the plottable data is: ``/entry/data/data``,
        its NeXus class path is: ``/NXentry/NXdata/data``
        """
        if self.name == SLASH:
            return ""
        else:
            h5_obj = self.h5_object

            classpath = str(self.parent.classpath)
            if classpath == CLASSPATH_OF_NON_NEXUS_CONTENT:
                logger.log(INFORMATIVE, "%s is not NeXus content", h5_obj.name)
                return CLASSPATH_OF_NON_NEXUS_CONTENT

            if not classpath.endswith(SLASH):
                if utils.isHdf5Group(h5_obj):
                    if "NX_class" in h5_obj.attrs:
                        nx_class = utils.decode_byte_string(h5_obj.attrs["NX_class"])
>                       if nx_class.startswith("NX"):
E                       AttributeError: 'list' object has no attribute 'startswith'

punx/validate.py:486: AttributeError

Fails because nx_class is ["NX"] instead of "NX"

prjemian commented 2 years ago

I agree that the examples are incorrect. They can be used to test for invalid structures. Will the proper structure be caught by other tests? If so, then the test can be modified to check that invalid structure is detected.

It is possible that other tests and data files may need such review.

carterbox commented 2 years ago

unitests aren't being run on this PR. It's because I'm using my own fork instead of using branches here. Should I add:

on: [push, pull_request]

to the GitHub action?

prjemian commented 2 years ago

Should I add:

Sure.

prjemian commented 2 years ago

The test says it is testing a list of axes but the only list attribute is a different one. I'll modify and upload.

On Sat, Nov 13, 2021, 4:31 PM Daniel Ching @.***> wrote:

@.**** commented on this pull request.

In punx/tests/test_issue129.py https://github.com/prjemian/punx/pull/155#discussion_r748772828:

+ +def test_render_multiple_axes_attribute(hfile):

  • """Ensure axes attributes are rendered as list of double quoted strings.
  • @axes should be saved as an array of byte strings or an array of object
  • strings because the NeXus standard says so. Single strings separated by
  • whitespace or other charachters will not be rendered correctly.
  • """
  • assert os.path.exists(hfile)
  • with h5py.File(hfile, "w") as h5:
  • nxentry = h5.create_group("Scan")
  • nxentry.attrs["NX_class"] = "NXentry"
  • nxdata = nxentry.create_group("data")
  • nxdata.attrs["NX_class"] = "NXdata"
  • nxdata.attrs["axes"] = "two_theta"

You want me to only compare that one line in the report to the reference instead of all lines in the report?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/prjemian/punx/pull/155#discussion_r748772828, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARMUMF7ONPEKPN2DOKTDOLUL3RMBANCNFSM5HJ4WXZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

prjemian commented 2 years ago

The point is that @axes in the existing test is not a list of str. The test must construct and test a test data file with two or more independent axes and a @signal with commensurate dimensions. 2-D (or higher) data. For this, need to consult the NeXus documentation and construct the data. Also need to confirm the data type of the @axes attribute, whether str or list(str).

prjemian commented 2 years ago

In the NeXus documentation (in the Index), the @axes attribute can be added either to the NXdata group (preferred), or to a dataset (discouraged, this is the old way, still valid but discouraged).

The 2-D example shows the structure is list(str) as shown:

@axes=["time","pressure"]

where the datasets time and pressure exist in the same group.

The unit test should create a data file with the structure of the 2-D example and assert the structure and content of the @axes attribute and also verify that the named axes exist as datasets in the same group.

prjemian commented 2 years ago

Random numbers for the actual data (of the datasets) are sufficient. These values will not be tested.

prjemian commented 2 years ago

Removed the print() calls since they do not contribute to the tests and will not appear in the pytest console output.

prjemian commented 2 years ago

Number data types are OS-dependent as shown in this example when testing on Windows 10:

        # compare line-by-line, except for file name
        # TODO: data size is OS-dependent?
        for ref, xture in zip(report[1:], structure[1:]):
>           assert ref.strip() == xture.strip()
E           AssertionError: assert 'data:NX_INT32[4,3] = [ ... ]' == 'data:NX_INT64[4,3] = [ ... ]'
E             - data:NX_INT64[4,3] = [ ... ]
E             ?            ^^
E             + data:NX_INT32[4,3] = [ ... ]
E             ?            ^^

punx\tests\test_issue129.py:77: AssertionError
prjemian commented 2 years ago

@carterbox This is ready for your comment. Should I recycle my review approval?