ioos / catalog

IOOS Catalog general repo for documentation and issues
https://ioos.github.io/catalog/
MIT License
2 stars 6 forks source link

GCMD keyword syntax issue on NANOOS datasets #74

Closed mwengren closed 3 years ago

mwengren commented 4 years ago

@emiliom this is mostly an FYI, but we recently added some new UI capabilities for GCMD keywords display in the IOOS Catalog. I happened to be researching RA models in the Catalog and found the SELFE model record (https://data.ioos.us/dataset/cmop-virtual-columbia-river-selfe-f3374ea3).

However, the GCMD keywords in the source metadata don't use the proper syntax of '>' to show the hierarchy. If they do, the result should be more like this: https://data.ioos.us/dataset/uw157-20190916t0000.

Same for at least this other ROMS model: https://data.ioos.us/dataset/regional-ocean-modeling-system-roms-oregon-coast9e82d

If you can fix or cause them to be fixed, let me know and I'll close this out.

emiliom commented 4 years ago

The IOOS Catalog UI for GCMD keywords looks great! Thanks for pointing out the metadata shortcoming in our two NANOOS models. Those should be easy fixes.

I'm pinging @cseaton and @crisien, who are responsible for the SELFE and OSU ROMS THREDDS servers, respectively. I'll follow up with them.

emiliom commented 4 years ago

@mwengren Both THREDDS servers have now been corrected. The SELFE model was done yesterday, and per our ncISO > NANOOS WAF job, the WAF metadata XMLs were updated at 1am PT. The corresponding IOOS catalog record isn't updated yet, but I assume the change should fully propagate within 24 hours? The OSU ROMS model was done a couple of hours ago. (Thanks so much, @cseaton and @crisien!)

While we're on the topic of keywords and the IOOS Catalog, why are GCMD Keywords and CF standard names largely replicated in the "Freeform Tags" section? I see it both on the model records and the sample glider records you included. Is there a good reason for doing that?

I'll check back on Monday when the updates to both model records should be propagated to the catalog.

mwengren commented 4 years ago

@emiliom we have an issue for that!

see: ioos/catalog-ckan#209

I suspect there may be issues with other aspects of the site if they're removed (like filtering in the faceted filter in the side navbar), but @benjwadams may know more about that.

So, it's at least on the list of improvements to consider. Hopefully fixed or closed with no action if it's not feasible by milestone due date at the end of January.

mwengren commented 4 years ago

@emiliom I had a look this morning, and it looks like it's not quite there yet.

The ROMS record doesn't look to be changed (not sure if that's an issue somewhere in the harvest pipeline or the source ISO XML didn't receive an updated element - which is what is checked against to trigger reharvest). It says 2019-12-13 so that's not likely the case.

For the SELFE model, it looks like the structure of the GCMD keywords section still isn't 100%. They need to be separated into individual elements a la this Glider DAC example: https://registry.ioos.us/waf/Glider%20DAC/f542c2e56147214df00e422896aaee88e898363a.xml

The SELFE record is still presenting them as a single element: https://registry.ioos.us/waf/NANOOS/d9fce8e42f73b586d5a4e388aeed533b4f5d0107.xml, which is why you see the strange hierarchy on the CKAN page: https://data.ioos.us/dataset/cmop-virtual-columbia-river-selfe-f3374ea3. Could be an ncISO issue, but more likely a syntax problem in ncML or netCDF.

emiliom commented 4 years ago

Thanks @mwengren. @cseaton and I did more digging on the SELFE model. The ISO XML directly available from his THREDDS server does show the GCMD keywords as individual gmd:keyword elements; see the XML here

However, I feed the NANOOS WAF using stand-alone ncISO (vers 2.3.1), and the resulting ISO XML mushes all the keywords into one element.

I could switch to just grabbing from that thredds xml url, but I see there are other differences in the xml file, so I don't know what's best. I also see that my stand-alone ncISO is old, and I could easily try updating to the latest (2.3.5), though I don't see any references to this issue in the release notes.

What do you recommend?

emiliom commented 4 years ago

Update: I just upgraded to the latest stand-alone ncISO (2.3.5) and reran it, but no cigar. GCMD keywords still get combined into a single gmd:keyword element. For what it's worth, ncISO 2.3.5 is nearly a year old, while @cseaton's THREDDS is pretty old, Version 4.3.23 - 20140826.1617.

mwengren commented 4 years ago

@kevin-obrien @noaaroland Can we ask for your help with understanding why the ncISO releases Emilio is using here are producing the garbled GCMD keywords?

Here's an example that shows the merged keywords we refer to in this issue: https://data.ioos.us/dataset/cmop-virtual-columbia-river-selfe-f3374ea3 (or see more above). Mostly wondering if there's an issue in the software or something in the source data/ncML formatting.

emiliom commented 4 years ago

Thanks. On a side note (relative to the ask about ncISO), I just realized that the OSU ROMS model catalog record @mwengren originally pointed out is not the THREDDS server @crisien manages. I'd forgotten that we have that model available via two servers, and two corresponding catalog records, THREDDS (Craig's) and Hyrax. The Hyrax one is managed by someone else for specific applications. I won't volunteer to look into the GCMD keywords issue on that one yet. The catalog record for Craig's THREDDS server is https://data.ioos.us/dataset/regional-ocean-modeling-system-roms-oregon-coast8447b It, too, has GCMD keyword issues, but, if possible, let's keep this issue focused on just the CMOP SELFE THREDDS record (https://data.ioos.us/dataset/cmop-virtual-columbia-river-selfe-f3374ea3) for now, to minimize confusion & complexity.

noaaroland commented 4 years ago

As far as I can tell, this questions boils down to which piece of software is responsible splitting the comma separated string of GCMD keywords that are stuffed into the netCDF attribute.

ERDDAP appears to do this and maybe the software building the graphical display might ought to do it also just in case it encounters therein an ISO file with a keyword which is a comma separated string.

That said, since the netCDF "keywords" attribute is always a comma separated list, then the XSLT template can be modified to separate on the commas. See the attached file for an example output from a modified XSL file. example.zip

If this output is good, I will make a release using this proposed new template.

emiliom commented 4 years ago

Your sample output from the modified XSL looks like it's doing the job of splitting the GCMD keywords. I didn't check for anything else, though. Thanks!

noaaroland commented 4 years ago

https://github.com/NOAA-PMEL/uafnciso/releases/tag/2.3.6

emiliom commented 4 years ago

Wow, thank you!! That was fast. I've downloaded 2.3.6 and changed my nciso script to use that version the next time it runs, overnight. I'll report back.

mwengren commented 4 years ago

@emiliom Reviewing open Catalog issues - I still see some GCMD keyword hierarchy issues here: https://data.ioos.us/dataset/cmop-virtual-columbia-river-selfe-f3374ea3.

The source record from the NANOOS WAF looks like the keywords themselves may not be formed correctly:

<gco:CharacterString>
Oceans; Ocean Temperature; Potential Temperature, Oceans; Salinity/Density; Salinity, Oceans; Sea Surface Topography; Sea Surface Height, Oceans; sea_water_potential_temperature ;sea_water_temperature; sea_water_salinity; Ocean Circulation; Ocean Currents; x_sea_water_velocity; y_sea_water_velocity
</gco:CharacterString>

There should be > characters between the elements in the hierarchy rather than semicolon correct? This looks like a separate problem from the ncISO issue, which I'm not sure whether is resolved or not because it doesn't appear to be separating on the comma either. Wondering if they could be related.

Also, the ; vs > issue seems to affect the OSU ROMS record as well: http://data.nanoos.org/metadata/ioos/thredds/thredds_dodsC_NANOOS_OCOS.xml

noaaroland commented 4 years ago

This http://amb6400b.stccmop.org:8080/thredds/dodsC/model_data/forecast.html seems to have disappeared for the moment so I can't check, but I believe that the files uses &gt; in netCDF file keywords attribute. At least in the example I have from when I made the change the keywords in the netCDF source all have the entity &gt; in between instead of a > character so they come out in the ISO XML as: `

Ocean Circulation > Ocean Currents ` So maybe things look strange because of this.
emiliom commented 4 years ago

Thanks @mwengren and @noaaroland! The IOOS Catalog was down for several days a bit over a week ago, so that set me back as far as testing is concerned.

So, to keep the workflow sequence in order:

The issue of &gt; vs > encoding in the netcdf files seems like a promising avenue, but I'm still confused why the XML produced by the THREDDS server (presumably by the ncISO plugin?) looks good but the one produced by the latest stand-alone ncISO doesn't.

Again, let's not get bring up the ROMS dataset at this time. That'll just confuse things.

mwengren commented 4 years ago

@emiliom If you're able to revisit this issue again, I wanted to point out that we recently resolved the Registry -> Catalog harvesting issue (ioos/catalog-harvest-registry#134), so if you make changes to the TDS/ncISO configuration generating those XML records and issue a reharvest, you should be able to see more or less instantaneous updates on the Catalog side (let us know if you don't!).

cc @benjwadams

emiliom commented 4 years ago

Thanks, @mwengren. That'll be helpful in getting to the bottom of this issue. I'll try to get back to it next week.

emiliom commented 4 years ago

Following up on this issue.

For reference / follow-ups: IOOS Catalog dataset urls are not persistent in the long term (the catalog doesn't persist them). The urls referenced previously in this issue are no longer valid. The new urls are:

mwengren commented 3 years ago

@emiliom This may not be in your purview any longer, but I'm trying to clean up dormant issues for Catalog.

Tracking this down a bit, I believe the records for these datasets originated from this WAF, which is now empty (or filled with zero length files to be specific):

http://data.nanoos.org/metadata/ioos/thredds/

As a result, the Catalog Registry harvests are failing, and one of these two datasets - OSU ROMS still has a metadata date from 2020-02:

https://data.ioos.us/dataset/regional-ocean-modeling-system-roms-oregon-coast

The issue still exists, but there's no way to tell if it's been resolved somewhere in the pipeline without new metadata to harvest. Can someone at NANOOS look into this?

Related point, I'm assuming I should replace Craig as the POC for all NANOOS harvests in the Registry, correct?

emiliom commented 3 years ago

I can work with @cseaton and @crisien to help resolve this once and for all. But only after our current nciso fatal issue is resolved.

mwengren commented 3 years ago

This issue was resolved by upgrading NANOOS' ncISO execution environment to Java JDK 8.