ioos / compliance-checker

Python tool to check your datasets against compliance standards
http://ioos.github.io/compliance-checker/
Apache License 2.0
104 stars 56 forks source link

Bug: Datasets with comma delimited global instrument string values #1082

Open jcermauwedu opened 1 month ago

jcermauwedu commented 1 month ago

The checker does ok for single files, but triggered an error with multiple files. Could that happen due to different CF standards being referenced in different datasets? If all the datasets had the same table number, it might work. I can retest and add a reply to this issue.

I think this highlights the pre-loading issue in the architecture? When the CF class is initiated, the standard name table is immediately initialized.

$ compliance-checker -v -t og:1.0 -t cf:1.8 ~/src/upstream/data/*.nc
Running Compliance Checker on the datasets from: ['/home/portal/src/upstream/data/dfo-rosie713-20190615.nc', '/home/portal/src/upstream/data/gcoos_ioos-station-wmo-42874_2024_05.nc', '/home/portal/src/upstream/data/gcoos_ioos-station-wmo-42881_2024_05.nc', '/home/portal/src/upstream/data/nos.ngofs2.fields.f045.20240522.t03z.nc', '/home/portal/src/upstream/data/nos.sfbofs.fields.n006.20240522.t03z.nc', '/home/portal/src/upstream/data/sea076_20230906T0852_R.nc', '/home/portal/src/upstream/data/sg558_20240206T000000_R.nc', '/home/portal/src/upstream/data/sp041_20191205T1757_R.nc', '/home/portal/src/upstream/data/unit_345_20231112T000000_R.nc']
Using cached standard name table v49 from /home/portal/.local/share/compliance-checker/cf-standard-name-table-test-49.xml
Using cached standard name table v73 from /home/portal/.local/share/compliance-checker/cf-standard-name-table-test-73.xml
Using cached standard name table v73 from /home/portal/.local/share/compliance-checker/cf-standard-name-table-test-73.xml
Using cached standard name table v49 from /home/portal/.local/share/compliance-checker/cf-standard-name-table-test-49.xml
Traceback (most recent call last):
  File "/home/portal/miniconda3/envs/py311ioos/bin/compliance-checker", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/portal/miniconda3/envs/py311ioos/bin/cchecker.py", line 299, in main
    return_value, errors = ComplianceChecker.run_checker(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/runner.py", line 78, in run_checker
    score_groups = cs.run_all(ds, checker_names, include_checks, skip_checks)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/suite.py", line 414, in run_all
    checker.setup(ds)
  File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/cf/cf_base.py", line 104, in setup
    self._find_geophysical_vars(ds)
  File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/cf/cf_base.py", line 722, in _find_geophysical_vars
    self._geophysical_vars[ds_str] = cfutil.get_geophysical_variables(ds)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/cfutil.py", line 390, in get_geophysical_variables
    if is_geophysical(nc, variable) and variable not in get_bounds_variables(nc):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/cfutil.py", line 230, in is_geophysical
    if variable in get_instrument_variables(nc):
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/cfutil.py", line 637, in get_instrument_variables
    if instrument and instrument in nc.variables:
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unhashable type: 'list'
jcermauwedu commented 1 month ago

Doing some additional testing, we found an edge case that causes a problem. Will diagnose and get a pytest written to exercise that bug.

Reference dataset: https://github.com/OceanGlidersCommunity/OG-format-user-manual/blob/main/og_format_examples_files/sg558_20240206T000000_R.cdl

$ compliance-checker -v -t og:1.0 -t cf:1.8 ~/src/upstream/data/sg558_20240206T000000_R.nc
Running Compliance Checker on the datasets from: ['/home/portal/src/upstream/data/sg558_20240206T000000_R.nc']
Traceback (most recent call last):
  File "/home/portal/miniconda3/envs/py311ioos/bin/compliance-checker", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/portal/miniconda3/envs/py311ioos/bin/cchecker.py", line 299, in main
    return_value, errors = ComplianceChecker.run_checker(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/runner.py", line 78, in run_checker
    score_groups = cs.run_all(ds, checker_names, include_checks, skip_checks)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/suite.py", line 414, in run_all
    checker.setup(ds)
  File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/cf/cf_base.py", line 104, in setup
    self._find_geophysical_vars(ds)
  File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/cf/cf_base.py", line 722, in _find_geophysical_vars
    self._geophysical_vars[ds_str] = cfutil.get_geophysical_variables(ds)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/cfutil.py", line 390, in get_geophysical_variables
    if is_geophysical(nc, variable) and variable not in get_bounds_variables(nc):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/cfutil.py", line 230, in is_geophysical
    if variable in get_instrument_variables(nc):
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/cfutil.py", line 637, in get_instrument_variables
    if instrument and instrument in nc.variables:
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unhashable type: 'list'
jcermauwedu commented 1 month ago

The culprit is caused by a couple of datasets. The issue occurs when there is a list of instruments in the global attribute instrument.

Example:

string :instrument = "WET Labs {Sea-Bird WETLabs} ECO Puck Triplet BBFL2-IRB scattering fluorescence sensor", "Oxygen optode 4831", "Unpumped CT sail CTD", "Seaglider M1 Glider data logger" ;
jcermauwedu commented 1 month ago

This can be listed as a bug. The OG format will typically not have comma delimited instrument values. However, if anyone has a global attribute with a comma separated list, it will bomb the checker.

benjwadams commented 1 month ago

The instrument attribute you have in the CDL is an array of string values, not a comma separated string of instrument variables. This shouldn't really be causing the code to fail, but I'm not clear from the ACDD 1.3 guidelines which this is based on: https://wiki.esipfed.org/Attribute_Convention_for_Data_Discovery_1-3

Comma-Separated Lists Several attributes explicitly allow the entry of multiple entities as comma-separated values. Any entities within such lists which contain a comma must be enclosed in straight double quotation marks ("), which will not be considered part of the entity. Spaces (ASCII character 32) between the entities are recommended for readability, but not required. Example: 'John Doe, Jane Lee, "L J Smith, Jr." ' The same protocol may be used within free-text attributes, but is only recommended in cases where the attribute is being populated with structured data (rather than unconstrained text).

Later on in the document, there's this guidance on instrument attribute:

instrument | Name of the contributing instrument(s) or sensor(s) used to create this data set or product. Indicate controlled vocabulary used in instrument_vocabulary.

From this comment, it's not clear how multiple instruments are defined. One would assume comma separated string is valid from the first quote, but it's not clear whether array of string types are also valid in this attribute. So we'll have to clarify this particular case with whoever is maintaining the ACDD standard.

jcermauwedu commented 1 month ago

I managed to login to the ESIP Slack. It looks like it is recommended to ask on the mailing list. The last real activity on slack with reference to ACDD was late 2021. Yuhan Rao asked about ACDD 1.3 and was referred to the mailing list. Yuhan is working on the AI Ready Data checklist. Now to track down the mailing list.

Mailing list requires registration to access archive.

jcermauwedu commented 1 month ago

The general question is what are the valid data types for the global attribute instrument?