ioos / catalog

IOOS Catalog general repo for documentation and issues
https://ioos.github.io/catalog/
MIT License
2 stars 6 forks source link

Duplicate `<gmd:fileIdentifier>` fields in ISO records #68

Closed kwilcox closed 4 years ago

kwilcox commented 6 years ago

I did a little analysis on all of the ISO files in the registry and found a few issues. These are mostly related to RAs assigning dataset ids incorrectly. For ISO generated through THREDDS, the fileIdentifier in the ISO record is taken by combining the naming_authority and the id global attributes from the dataset.

GLOS @tslawecki

MARACOOS @brianmckenna

Many of the satellite datasets suffer from the same issue as described for GLOS. For example, both AVHRR.2012.7Agg.xml and AVHRR.2013.7Agg.xml end up with the fileIdentifier of org.maracoos:avhrr.sst.

AVHRR.2012.7Agg.xml
AVHRR.2013.1Agg.xml
AVHRR.2013.3Agg.xml
AVHRR.2013.7Agg.xml
AVHRR.2013.Masked.Agg.xml
AVHRR.2014.1Agg.xml
AVHRR.2014.3Agg.xml
AVHRR.2014.7Agg.xml
AVHRR.2014.Masked.Agg.xml
AVHRR.2014.Unmasked.Agg.xml
AVHRR.2015.1Agg.xml
AVHRR.2015.3Agg.xml
AVHRR.2015.7Agg.xml
AVHRR.2015.Masked.Agg.xml
AVHRR.2015.Unmasked.Agg.xml
AVHRR.2016.1Agg.xml
AVHRR.2016.3Agg.xml
AVHRR.2016.7Agg.xml
AVHRR.2016.Masked.Agg.xml
AVHRR.2016.Unmasked.Agg.xml
AVHRR.2017.1Agg.xml
AVHRR.2017.3Agg.xml
AVHRR.2017.7Agg.xml
AVHRR.2017.Masked.Agg.xml
AVHRR.2017.Unmasked.Agg.xml
MURSST.2014.Agg.xml
MURSST.2015.Agg.xml
MURSST.2016.Agg.xml
MURSST.2017.Agg.xml

NERACOOS @ebridger

There is a conflict between the Realtime and the Historic Realtime datasets fileIdentifiers. I can see how this would be done on purpose but if that was the case is there a reason to have both the Realtime and the Historic Realtime in the WAF? For example, these two ISO files have the same fileIdentifier:

rsignell-usgs commented 6 years ago

@kwilcox , fantastic! We need this kind of feedback to providers so we can make the catalog even better!

ebridger commented 6 years ago

@kwilcox. The Realtime WAF was created quite a few years ago specifically for the Sensor Map https://sensors.ioos.us/# (in detailed consultation with Axiom) since the scrapper/crawler was overwhelming our THREDDS requesting observations from historical buoys no longer deployed. I'd be glad to remove that WAF if the sensor map harvester no longer requires it.

kwilcox commented 6 years ago

@ebridger That makes sense. You have a separate set of files you setup for the Realtime data access so the HistoricRealtime datasets was not overloaded. I don't recall this conversation but if the data is showing up and you are happy with it I'm not going to open that discussion back up!

https://data.ioos.us/dataset?q=%22A01+ACCELEROMETER%22&sort=score+desc%2C+metadata_modified+desc&ext_bbox=&ext_prev_extent=-154.68749999999997%2C-80.17871349622823%2C154.68749999999997%2C80.17871349622823

I see (2) records with the same ERDDAP endpoint (probably a different issue) and (1) record that is the HistoricRealtime THREDDS endpoint. The Realtime dataset isn't in the catalog (or I can't find it), most likely due to the conflict in fileIdentifier. You could change the id of the real-time only files to be unique but it's probably not a huge deal if you really only want the HistoricRealtime dataset in there.

rsignell-usgs commented 6 years ago

We definitely want the Realtime WAF ingested, right?
It makes sense to change the id for the Realtime ISO records so they don't conflict with the HistoricalRealtime ISO records.

mwengren commented 6 years ago

@kwilcox thanks for the report!

BTW, for @tslawecki @brianmckenna and @ebridger, the place to find issues like the fileIdentifier conflicts Kyle mentions is in the Harvest Registry, click on the 'View CKAN Job Status' button:

ckan_job

This is the only place where we can report fileIdentifier conflicts, as the Registry will accept them, but CKAN will not. Anything that shows up in this list as an error does not make it to Catalog.

ebridger commented 6 years ago

I decided to keep the Realtime WAF. One issue is that the id is a NetCDF global attribute and the historical realtime aggregations are really 2 files, the historical file and the latest realtime deployment file. Theidis the same in both files. The realtime THREDDS catalog only references the realtime files. So the fix was to use ncml only in the realtime catalog to override 'id' global attribute by appending -realtimeto the id value. I've regenerated the WAF. Not sure if I need to force a catalog re-harvest or if the catalog will pick it up automatically.

brianmckenna commented 6 years ago

MARACOOS WAF has been updated. Should see unique IDs soon.

benjwadams commented 5 years ago

This appears to be fixed for MARACOOS.

mwengren commented 5 years ago

@kwilcox Any chance you can confirm easily if this has been resolved (at least for NERACOOS and MARACOOS). Not sure GLOS' status.

kwilcox commented 5 years ago

I had a nice little script that tested all of this but I can't find it... so no, I can't easily confirm, sorry!

benjwadams commented 5 years ago

I'm crafting an updated release, so I'm going to move this into the next milestone.

mwengren commented 4 years ago

I think we can close this one out at long last. If there are still issues that come up, we'll deal with them as they come up.