ioos / ckanext-ioos-theme

IOOS Catalog as a CKAN extension
GNU Affero General Public License v3.0
7 stars 14 forks source link

Resolve AmeriGEOSS Harvesting of IOOS Catalog #163

Closed mwengren closed 4 years ago

mwengren commented 6 years ago

AmeriGEOSS datahub (https://data.amerigeoss.org/organization/ioos) is planning to harvest the full IOOS Catalog inventory.

This issue is to track diagnosing harvesting issues they've had with some IOOS metadata and ensure they are able to fully harvest our inventory.

Last error message received from AmeriGEOSS:

6663 Objects were found for harvest from IOOS

4816 Validated Datasets were found

1847 Validation errors:

1846 of the errors were of the "alphanumeric" type similar to the following:

ValidationError: {u'tags': [u'Tag "OCEANS > OCEAN CIRCULATION > OCEAN CURRENTS" must be alphanumeric characters or symbols: -_.', {}, {}, {}, {}, {},
{}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}]}

1 validation error was only this:

ValidationError: {u'tags': [u'Tag " " length is less than minimum 2', {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}]}
benjwadams commented 6 years ago

I'm trying to run a local CKAN with a similar version number and kick off a harvest to see what the effect will be.

benjwadams commented 6 years ago

It's odd because those tags should not be getting through with the angle brackets: https://github.com/ckan/ckan/blob/5b8192c70defe29541088ae45870456846d8622c/ckan/logic/validators.py#L424-L430

mwengren commented 6 years ago

Discuss this topic on a future CKAN working group call: how to handle GCMD Keywords in CKAN? Is there a convention used by data.gov, data.noaa.gov, other, if the '>' in GCMD keywords isn't supposed to be accepted by CKAN?

Need to figure out a way to make a valid GDMC keyword for CKAN, because many IOOS datasets include these and they are in the top 10 tags filters on data.ioos.us.

For example, filtering by GCMD Salinity keyword 'Oceans > Salinity/Density > Salinity ':

https://data.ioos.us/dataset?tags=Oceans+%3E+Salinity%2FDensity+%3E+Salinity&_tags_limit=0 yields 686 datasets at the moment. We want to preserve the ability to filter by GDMC keywords in a straightforward way.

benjwadams commented 6 years ago

NOAA's data catalog appears to also be using the GCMD conventions:

https://data.noaa.gov/dataset/?sort=score+desc%2C+metadata_modified+desc&tags=CONTINENT+%3E+NORTH+AMERICA+%3E+UNITED+STATES+OF+AMERI&q=ocean+science&_res_format_limit=0

mwengren commented 5 years ago

pinged AmeriGEOSS on plans

mwengren commented 5 years ago

Met with AmeriGEOSS on 3/20. They plan to upgrade to CKAN 2.8 and will reach out to setup a followup call for assistance if necessary. 2.8 should help with their harvest of IOOS' data.

We agreed to help them out with deployment/configuration advice for CKAN as a fellow OOS/OSS organization contributing to the cause. Moving this to the next release to track.

benjwadams commented 5 years ago

For what it's worth, #200 now splits GCMD keywords into separate tags, so I imagine many of the errors in the original issue will go away.

mwengren commented 4 years ago

Good news, I got an email from Rich Frazier saying they'd upgraded to CKAN 2.8 and it helped out in various ways including being able to harvest out metadata.

See: https://data.amerigeoss.org/organization/ioos

Dates in here are as recent as 7/15/19, so I think we can finally close this one out.