ioos / catalog

IOOS Catalog general repo for documentation and issues
https://ioos.github.io/catalog/
MIT License
2 stars 6 forks source link

Create/update https://data.ioos.us/waf #72

Closed mwengren closed 4 years ago

mwengren commented 4 years ago

Presently, there's a static set of subdirectory WAFs from an old version (2016 and earlier) of the IOOS Catalog hosted at https://data.ioos.us/waf/. Because of some changes with NOAA's Data Catalog systems, we need to provide them a regularly-updated dump of all our dataset ISO XML files in a single WAF for harvest (until they are able to read our CS-W service).

Can we write a simple harvester script to either read our CS-W service or copy the source XML records from https://registry.ioos.us/waf and replace this entire directory of old stuff?

@benjwadams

benjwadams commented 4 years ago

What's the issue with harvesting from the current CSW we have running?

mwengren commented 4 years ago

Unfortunately, they're doing some re-tooling and the new tools don't yet support CS-W harvest. They're moving away from CKAN actually and replacing the web interface with a new in-house system called 'OneStop'. You can see it here: https://data.noaa.gov/onestop/. It has many different tools under the hood, like ElasticSearch for indexing for example. It's a slightly different take from the UI perspective than CKAN.

We have to work with what they're offering if we want to keep our harvesting process alive. They plan to support CS-W harvest, but it's still in the backlog.

We should also replace this content anyway: https://data.ioos.us/waf/. I think there's value for us in trying to maintain a singular waf at that URL as well. We can talk about level of effort, because as a workaround they can use https://registry.ioos.us/waf for now. If it's not worth the time, this may turn into just deleting all those old metadata records.

benjwadams commented 4 years ago

OK, no problem. I should be able to export any records from PyCSW to a folder using its export functionality: https://pycsw.org/faq/#how-do-i-export-my-repository

mwengren commented 4 years ago

Great, let's just do that since it's simple and wipe the old records and replace with that.

Let's put this at the top of the Catalog issue list for whenever you get back to working on it.

mwengren commented 4 years ago

For reference, this is what the new IOOS Catalog -> NOAA 'OneStop' Catalog -> Data.gov harvesting workflow that we need to support looks like:

NOAA Data Working Group Update

Here's an example what a CARICOOS record looks like in OneStop once harvested. Hopefully they'll be parsing and displaying more fields soon.

XCUL_MET_Historic_Realtime_Agg-1 XCUL_MET_Historic_Realtime_Agg-2

Same record in IOOS Catalog:

Screenshot

benjwadams commented 4 years ago

I tried exporting through pycsw-admin. Unfortunately, it looks like the currently running version of the code attempts to load all the records into memory prior to exporting to XML. For a small-sized CSW deployment, this would be OK, but for the current size of our data inventory, it's causing issues with exporting things all at once due to the large number of records. I'll continue looking for workarounds.

benjwadams commented 4 years ago

Created a job to handle this. I will push up to one of the catalog repos momentarily and then close this out.

mwengren commented 4 years ago

@benjwadams Was looking at Catalog/Registry this morning and noticed the source metadata 'stations' WAF disappeared sometime since Monday: https://data.ioos.us/stations/waf/. Can you look at the nginx config again and restore it?

Also, I'm going to pass on the https://data.ioos.us/waf URL to the NOAA Catalog/OneStop team to harvest. Even if we're not ready to close this issue out, since it's there already it should work for their purposes for testing.

benjwadams commented 4 years ago

Stations WAF has been restored.

benjwadams commented 4 years ago

There is a script running now to load the CSW contents from the database into a WAF. The PyCSW admin command tried to do this all at once and caused the server to be overloaded by the size of the requests. I have added a script which compares the md5sum of possibly existing metadata XML files in the WAF against the md5sum of XML contents of each record in the database. This is running successfully so far, and I will close out this issue briefly once I have added this script under revision control in one of the repos.

benjwadams commented 4 years ago

Implemented by https://github.com/ioos/catalog-docker-base/commit/d3660cedb53b78158c3e9ee0be1a488b23a632a6 and added to cron job, closing issue.