ioos / ckanext-ioos-theme

IOOS Catalog as a CKAN extension
GNU Affero General Public License v3.0
7 stars 14 forks source link

CKAN Solr Index Issues #148

Closed mwengren closed 7 years ago

mwengren commented 7 years ago

We have a problem with the Solr index and/or CKAN database integrity.

I'm reviewing NANOOS' datasets ahead of a meeting next week, and I can see several duplicated datasets that should be disallowed by CKAN's unique fileIdentifier restriction.

Here's some examples:

The OPeNDAP datasets are repeated three times (easy way to see this is there is an extra number [2,3] appended at the end of the URL for each of the duplicates).

The ERDDAP datasets are repeated twice (look for the '2' at the end of URLs).

Wondering if there have been any recent changes to the CKAN harvesting rules that might be cause?

These duplicates should be disallowed because the remote Registry metadata record each of the duplicates point to is the same XML file (so they must have the same fileIdentifier).

Possibly related to ioos/catalog-ckan#147?

mwengren commented 7 years ago

I think possibly related to #147, we have some duplication issues in the Solr index. We may want to do a clear and reharvest of everything (or another fix if there's a better option).

Here's an example, PacIOOS now has 1591 datasets, which is almost double what is listed in the Registry.

A search for 'Barbers Point' shows 3 sets of duplicates.

I suspect the same issue is happening for some other data providers whose metadata is automatically updating, but maybe not for those where the records' metadata is not being updated (this is the trigger for CKAN to update a dataset). CDIP anyway is another case.

mwengren commented 7 years ago

Looking better, thanks....

NANOOS ERDDAP links look cleaned up as well.

lukecampbell commented 7 years ago

the reindex just finished.

lukecampbell commented 7 years ago

I'm still looking into the sensorml2iso stuff

mwengren commented 7 years ago

Ok, I think Solr is back to normal now. I'll close this, but we should keep an eye on the ongoing health of the Solr index since it's seemed to have sync issues lately.

I'll open another issue, but it would be good to be able to account for the difference between the Catalog dataset count and Harvest Registry count somewhere. We've discussed before but didn't have a clear plan. I think that About page is the best location.