Open jcermauwedu opened 1 month ago
Doing some additional testing, we found an edge case that causes a problem. Will diagnose and get a pytest written to exercise that bug.
Reference dataset: https://github.com/OceanGlidersCommunity/OG-format-user-manual/blob/main/og_format_examples_files/sg558_20240206T000000_R.cdl
$ compliance-checker -v -t og:1.0 -t cf:1.8 ~/src/upstream/data/sg558_20240206T000000_R.nc
Running Compliance Checker on the datasets from: ['/home/portal/src/upstream/data/sg558_20240206T000000_R.nc']
Traceback (most recent call last):
File "/home/portal/miniconda3/envs/py311ioos/bin/compliance-checker", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/portal/miniconda3/envs/py311ioos/bin/cchecker.py", line 299, in main
return_value, errors = ComplianceChecker.run_checker(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/runner.py", line 78, in run_checker
score_groups = cs.run_all(ds, checker_names, include_checks, skip_checks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/suite.py", line 414, in run_all
checker.setup(ds)
File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/cf/cf_base.py", line 104, in setup
self._find_geophysical_vars(ds)
File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/cf/cf_base.py", line 722, in _find_geophysical_vars
self._geophysical_vars[ds_str] = cfutil.get_geophysical_variables(ds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/cfutil.py", line 390, in get_geophysical_variables
if is_geophysical(nc, variable) and variable not in get_bounds_variables(nc):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/cfutil.py", line 230, in is_geophysical
if variable in get_instrument_variables(nc):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/portal/miniconda3/envs/py311ioos/lib/python3.11/site-packages/compliance_checker/cfutil.py", line 637, in get_instrument_variables
if instrument and instrument in nc.variables:
^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unhashable type: 'list'
The culprit is caused by a couple of datasets. The issue occurs when there is a list of instruments in the global attribute instrument
.
Example:
string :instrument = "WET Labs {Sea-Bird WETLabs} ECO Puck Triplet BBFL2-IRB scattering fluorescence sensor", "Oxygen optode 4831", "Unpumped CT sail CTD", "Seaglider M1 Glider data logger" ;
This can be listed as a bug. The OG format will typically not have comma delimited instrument values. However, if anyone has a global attribute with a comma separated list, it will bomb the checker.
The instrument attribute you have in the CDL is an array of string values, not a comma separated string of instrument variables. This shouldn't really be causing the code to fail, but I'm not clear from the ACDD 1.3 guidelines which this is based on: https://wiki.esipfed.org/Attribute_Convention_for_Data_Discovery_1-3
Comma-Separated Lists Several attributes explicitly allow the entry of multiple entities as comma-separated values. Any entities within such lists which contain a comma must be enclosed in straight double quotation marks ("), which will not be considered part of the entity. Spaces (ASCII character 32) between the entities are recommended for readability, but not required. Example: 'John Doe, Jane Lee, "L J Smith, Jr." ' The same protocol may be used within free-text attributes, but is only recommended in cases where the attribute is being populated with structured data (rather than unconstrained text).
Later on in the document, there's this guidance on instrument
attribute:
instrument | Name of the contributing instrument(s) or sensor(s) used to create this data set or product. Indicate controlled vocabulary used in instrument_vocabulary.
From this comment, it's not clear how multiple instruments are defined. One would assume comma separated string is valid from the first quote, but it's not clear whether array of string types are also valid in this attribute. So we'll have to clarify this particular case with whoever is maintaining the ACDD standard.
I managed to login to the ESIP Slack. It looks like it is recommended to ask on the mailing list. The last real activity on slack with reference to ACDD was late 2021. Yuhan Rao asked about ACDD 1.3 and was referred to the mailing list. Yuhan is working on the AI Ready Data
checklist. Now to track down the mailing list.
Mailing list requires registration to access archive.
The general question is what are the valid data types for the global attribute instrument
?
The checker does ok for single files, but triggered an error with multiple files. Could that happen due to different CF standards being referenced in different datasets? If all the datasets had the same table number, it might work. I can retest and add a reply to this issue.
I think this highlights the pre-loading issue in the architecture? When the CF class is initiated, the standard name table is immediately initialized.