ioos / ckanext-ioos-theme

IOOS Catalog as a CKAN extension
GNU Affero General Public License v3.0
7 stars 14 forks source link

Harvest troubleshooting: USGS CS-W #237

Open mwengren opened 1 year ago

mwengren commented 1 year ago

USGS: https://data.ioos.us/harvest/usgs-cs-w/job/last

Example error messages:

1   Empty record for GUID owslib_random_45948
1   Error getting the CSW record with GUID 53a1c19ee4b0403a44154562
1   Empty record for GUID owslib_random_55332
1   Empty record for GUID owslib_random_62208
1   Empty record for GUID owslib_random_6633
1   Empty record for GUID owslib_random_54992
1   Error getting the CSW record with GUID 52f95fdfe4b05b4bef798f16
1   Error getting the CSW record with GUID 535a5e37e4b0d08644962764

Related issue:

What impacts are there to datasets harvested if the harvest type is CS-W instead of IOOS WAF?

Do we know what the IOOS WAF type does on top of regular CKAN 'WAF'?

mwengren commented 1 year ago

Related ckanext-spatial issue: https://github.com/ckan/ckanext-spatial/issues/310 (from last Catalog meeting notes)

How can we make progress on this CS-W harvest problem if the upstream issue is stalled? Any workarounds?

@benjwadams

benjwadams commented 6 months ago

https://www.sciencebase.gov/catalog/item/579b64c6e4b0589fa1c98118/csw?service=CSW&version=2.0.2&request=GetRecordById&id=5df54bf0-3a7d-44bf-9abf-84d772da8df1&elementsetname=full&outputSchema=http://www.isotc211.org/2005/gmd

The output looks extremely truncated and no usable metadata could really be gathered from this. Are there other parameters which could give more detailed metadata? This looks like a provider issue presently.

mwengren commented 6 months ago

@benjwadams Where did the ID you used in your example query come from?

I tried to trace the CS-W service a bit manually, and got better results from testing a few different IDs than the one you posted.

First, I ran this query: https://www.sciencebase.gov/catalog/item/579b64c6e4b0589fa1c98118/csw?service=CSW&version=2.0.2&request=GetRecordById

This returns a full set of records in the CS-W, from which you can grab individual IDs from the field to test in a narrower GetRecordById query similar to the one you posted.

For example, here are a few ISO XMLs from the same query you posted but using the first two dataset IDs from the above query:

  1. ID: 59eb3c97e4b0026a55ffbf47 - https://www.sciencebase.gov/catalog/item/579b64c6e4b0589fa1c98118/csw?service=CSW&version=2.0.2&request=GetRecordById&id=59eb3c97e4b0026a55ffbf47&elementsetname=full&outputSchema=http://www.isotc211.org/2005/gmd
  2. ID: 59eb3c97e4b0026a55ffbf51 - https://www.sciencebase.gov/catalog/item/579b64c6e4b0589fa1c98118/csw?service=CSW&version=2.0.2&request=GetRecordById&id=59eb3c97e4b0026a55ffbf51&elementsetname=full&outputSchema=http://www.isotc211.org/2005/gmd

How is the CS-W harvester working and why isn't it querying using the appropriate IDs? The records appear to be there, from a few quick spot checks above.

We need to troubleshoot the CS-W harvester and underlying libraries like OWSlib further, I think.

benjwadams commented 6 months ago

This appears to be a provider issue. CKAN is sending capitalized parameter names, which for the case of CSW 2.0.2 and GetRecordById, should work. See footnote b in the below image.

image

See page 161 of the OGC CSW 2.0.2 specification: https://portal.ogc.org/files/?artifact_id=20555

Thus, your examples posted here do in fact return appropriate data, but changing the case of the "Id" parameter causes the result to fail in violation of the CSW specification:

https://www.sciencebase.gov/catalog/item/579b64c6e4b0589fa1c98118/csw?service=CSW&version=2.0.2&request=GetRecordById&ID=59eb3c97e4b0026a55ffbf47&elementsetname=full&outputSchema=http://www.isotc211.org/2005/gmd

I reset job reports, but ckanext-spatial appeared to be uppercasing the parameter names and percent escaping certain special character. Various permutations of this are fine up until the id parameter.

mwengren commented 5 months ago

@benjwadams How difficult would it be to change the CKAN harvesting plug in to send the parameters according to the capitalization that USGS ScienceBase expects?

mwengren commented 5 months ago

I'd like to try to close the loop on resolving this issue before too much more time passes. We'd like to be able to harvest the MBON and other IOOS-related bio data that's in OBIS into the IOOS Data Catalog.

The pathway we've used before has been the CS-W service provided by ScienceBase, specifically: https://www.sciencebase.gov/catalog/item/579b64c6e4b0589fa1c98118/csw?service=CSW&version=2.0.2&request=GetCapabilities

@benjwadams has done some troubleshooting above and noticed an issue that the id parameter in the ScienceBase CS-W is case-sensitive, when according to the spec, it technically shouldn't be. Example:

CKAN's default behavior is to uppercase the ID parameter, unfortunately.

A workaround is to change the ckanext-spatial plug in code to lowercase that parameter (most likely the easiest fix I expect for this), but it's worth investigating if this might be fixed on the ScienceBase side instead.

CC'ing @MathewBiddle @laurabrenskelle and @sformel-usgs in case you can help with the OBIS/ScienceBase connection and also if you have suggestions about anything we should be doing differently to get the MBON data from OBIS into IOOS Catalog.

The CS-W root service URL (https://www.sciencebase.gov/catalog/item/579b64c6e4b0589fa1c98118/csw?) was provided to us years ago by Abby Benson, so it's possible that needs to be changed or updated as well to be getting the right data? I'm not 100% certain that the results are what we're expecting, so it would be good to confirm that we have the right service endpoint as well.

MathewBiddle commented 5 months ago

cc @albenson-usgs

MathewBiddle commented 5 months ago

@albenson-usgs and @sformel-usgs, would either of you be able to get us in contact with the maintainer of the CS-W service for ScienceBase? Or, a ScienceBase POC where we can start the conversation? That seems like a logical connection we should make to resolve this issue.

albenson-usgs commented 5 months ago

The best way to get in touch with ScienceBase is sciencebase@usgs.gov. But not all the MBON datasets are in ScienceBase since some folks load them directly to the IPT themselves (CeNCOOS for example) so perhaps it's time to switch to harvesting directly from OBIS https://obis.org/institute/23070? I'm also trying to think if there might be any IOOS datasets in OBIS that are not labeled MBON- none are coming to mind right now but it's not impossible, right? Does IOOS consider any bio data coming out of the IOOS RAs as MBON? Will that always be the case? We could also consider if it's possible to harvest from the IPT. Perhaps a call to talk through these things could be helpful.

albenson-usgs commented 5 months ago

I guess at the moment you are harvesting any datasets, not just IOOS mediated ones, so it would be worth considering what the exact goal of the harvesting is because that will help define where the best place to do the harvesting from is.

laurabrenskelle commented 5 months ago

@albenson-usgs

Does IOOS consider any bio data coming out of the IOOS RAs as MBON?

The answer to this is no. This is one reason we are discussing having PIDs for the RAs, so we can better pinpoint what is "IOOS" data in OBIS, but I think that is tangential to this issue.

mwengren commented 5 months ago

My understanding is that we'd like to include as much IOOS Marine Life data as possible in the IOOS Catalog.

Up to this point, that's been limited to what's been available via the ScienceBase CS-W service, which I don't think was expected to be comprehensive beyond the MBON datasets in happened to include, it was only what we knew of at the time. Those datasets have also always been labeled USGS in the IOOS Catalog organization scheme for whatever reason I don't recall.

I think there's definitely willingness to improve on the representation of IOOS Marine Life data in the Catalog, however we might go about that. Open to all suggestions!

What are the mechanisms to harvest directly from OBIS/https://obis.org/institute/23070?

If it's more appropriate to create a new issue for a harvest process unrelated to ScienceBase, we might want to do that instead - we try to log general Catalog functionality issues in the ioos/catalog repo FWIW.

sformel-usgs commented 4 months ago

What are the mechanisms to harvest directly from OBIS/https://obis.org/institute/23070?

I don't know of any direct harvesting methods outside of an API call. I've pinged Pieter Provoost to see if he has any other ideas.

sformel-usgs commented 3 months ago

@mwengren Ok, Pieter has suggested that the RSS feed is more reliable than the OBIS API. He said, "in OBIS sometimes we have to manually connect datasets to providers if the contact details don't match exactly with previous entries. Using the IPT feed directly will give you more control."

He pointed out that the RSS feed (https://ipt-obis.gbif.us/rss.do) includes a direct link to the EML, so you can crawl the metadata without having to download the entire DwC-A. I suppose from there you can build a whitelist, or crawl all the EML every time.

mwengren commented 3 months ago

@sformel-usgs Thanks for the suggestions! I think this will probably end up being a longer-term effort as far as OBIS/MBON data is concerned. We can do some digging as far as CKAN's support for other XML metadata/data formats than ISO XML (essentially the only incoming metadata format we're currently supporting in the Catalog).

On a high-level IOOS DMAC perspective, we should establish a plan for whether/how to include IOOS Marine Life data in the IOOS Catalog. This issue: https://github.com/ioos/marine_life_data_network/issues/52 can be a starting point for that conversation. cc: @MathewBiddle @laurabrenskelle.

I'll copy this comment over to that issue as well, but AFAIK we (IOOS) have a requirement to furnish ISO XML metadata (or perhaps DCAT JSON, not 100% sure on that alternative) to NOAA for inclusion in NOAA's enterprise data inventories.

For all of IOOS' non-bio data, it's been fairly straightforward to do this as most of the software we use has been developed to able to output an ISO XML metadata representation of the datasets they serve. Since that isn't the case for OBIS, MBON, or ATN (I believe), that's something we'll need to address for both including those data in IOOS Catalog with it's current capabilities, and also for sending up the chain to NOAA to meet those requirements.

mwengren commented 3 months ago

@albenson-usgs: thanks for sharing the ScienceBase contact info. @benjwadams reached out to them regarding their CS-W service and referred them to the CS-W documentation that explains the compatibility issue their CS-W service has with the spec.

Hopefully, that will be productive and we can once again ingest that CS-W source here, but to be honest, I don't recall the data actually residing there and whether or not it continues to make sense to include it in the IOOS Catalog. We can sort that out later if the service is able to be fixed, however.