ioos / ckanext-ioos-theme

IOOS Catalog as a CKAN extension
GNU Affero General Public License v3.0
7 stars 14 forks source link

Determine Harvest Pathway to Data.gov #85

Closed mwengren closed 5 years ago

mwengren commented 7 years ago

Determine the pathway for IOOS Catalog records to be listed in Data.gov. Choice is between direct harvest to Data.gov and intermediate harvest by NOAA Data Catalog (data.noaa.gov) first.

The choice affects how IOOS Catalog records would be reflected in Data.gov.

DMAC Architecture team to discuss.

mwengren commented 7 years ago

Email out to NOAA Catalog group:

NOAA Catalog group,

We would like to set up a harvest of the IOOS Catalog (https://data.ioos.us/) by the NOAA Catalog. There are a couple different ways this could be accomplished.

Since the IOOS Catalog is also CKAN/pycsw based, there is an integrated CS-W service that could be harvested as a single harvest job. This would be the simplest approach to set this up. The CS-W URL is: https://data.ioos.us/csw?request=GetCapabilities&service=CSW&version=2.0.2.

We also maintain a set of WAFs with identical metadata (which is the source for our own Catalog). If the CS-W harvest doesn't suffice from the NOAA Catalog perspective, these could be harvested as a backup. The WAFs are found here: https://registry.ioos.us/waf/.

In either case, an Organization should be set up to own the harvest, similar to: https://data.ioos.us/organization/about/ioos.

Please advise how to proceed.

mwengren commented 7 years ago

Response from NOAA Catalog admin:

I don't have a problem with trying CS/W. If there aren't any other issues that are out of my control, I should be able to complete moving the Catalog to NCEI-CO. If we could try it there first that would be best.

Chris

mwengren commented 7 years ago

Copying from email thread on this topic to keep this issue up to date. It seems the CKAN CS-W harvesters are unable to retrieve records (perhaps from latest release of pycsw 1.10.5 that is run by both IOOS and NOAA Catalog). Data.gov is also unable to harvest from NOAA Catalog due to this issue.

Here's the end goal:

IOOS Catalog ---- CS-W ---> NOAA Catalog ----- CS-W ------> Data.gov.

Neither of the CS-W links currently work.

From email thread:

2017-08-02 21:26:32,052 INFO [ckanext.harvest.queue] Received harvest object id: bf6b5736-d70a-4316-ba2a-43540cfc24a9 2017-08-02 21:26:32,282 DEBUG [ckanext.spatial.harvesters.csw.CSW.fetch] CswHarvester fetch_stage for object: bf6b5736-d70a-4316-ba2a-43540cfc24a9 2017-08-02 21:26:32,527 INFO [ckanext.spatial.lib.csw_client] Making CSW request: getrecordbyid [u'sldmb_43181_agg'] {'esn': 'full', 'outputschema': 'http://www.isotc211.org/2005/gmd'}

https://data.ioos.us/csw?OUTPUTFORMAT=application%2Fxml&SERVICE=CSW&OUTPUTSCHEMA=http%3A%2F%2Fwww.isotc211.org%2F2005%2Fgmd&REQUEST=GetRecordById&VERSION=2.0.2&ID=sldmb_43181_agg

My hunch about this is that CKAN somehow isn't forming the GetRecordById request properly to comply with the latest pycsw 1.10.5 release, since the link above clearly works.

If I remember correctly, the CKAN harvester uses HTTP POST rather than GET, so it may be difficult to diagnose looking at the server logs on our side. How can we see the POST content for troubleshooting? Would this require a TCP network sniffer or something?

mwengren commented 6 years ago

Latest status update from NOAA Catalog group on the harvesting troubleshooting:

10/17/1017

Hi Micah,

Thanks. I've been out of the office on leave and I won't be back until after the 19th. Yes, I'll let you know the schedule for that. I still don't have harvesting working with Data.gov. I dug into the harvesting a bit when I was testing it myself. The problem was in owslib. The request to the remote CSW was correct, but the record was being reject by the parsing in owslib. The harvesting extension uses a custom version of owslib and hasn't been updated in a while. When I'm back in the office I'll look again at harvesting from IOOS.

On Thu, Oct 12, 2017 at 6:26 AM, Micah Wengren wrote:

Chris,

Congrats on migrating it in house to NCEI! Hopefully they will allow for incremental enhancements now that it is running operationally there.

A related question, has there been any progress with Data.gov on the CS-W harvesting issue? Last I knew there were still problems, possibly related to the latest pycsw release not working with CKAN's harvesters somehow. We'd like to take care of the IOOS Catalog -> NOAA Catalog harvesting at some point, so let me know if there is anything I can do to help in testing the harvest from NOAA if you want to revisit that again.

mwengren commented 5 years ago

Related ckanext-spatial issue - CS-W harvest bug: https://github.com/ckan/ckanext-spatial/issues/209

benjwadams commented 5 years ago

What's the current staus of this?

mwengren commented 5 years ago

I checked my email history. Looks like we're waiting for NOAA Catalog to upgrade to fix the special character/escape issue (although that shouldn't be a blocker for closing this), and also Data.gov isn't doing regular harvests of NOAA Catalog yet - and I think the were going to do a single harvest to pull in IOOS new metadata, not sure that's happened yet.

mwengren commented 5 years ago

Data.gov is now harvesting from the NOAA Catalog on as needed basis. IOOS metadata has already been harvested as of 2/27. Data.gov will harvest NOAA Catalog weekly at some point in the future.

Closing this issue finally!