Closed mwengren closed 7 years ago
I think possibly related to #147, we have some duplication issues in the Solr index. We may want to do a clear and reharvest of everything (or another fix if there's a better option).
Here's an example, PacIOOS now has 1591 datasets, which is almost double what is listed in the Registry.
A search for 'Barbers Point' shows 3 sets of duplicates.
I suspect the same issue is happening for some other data providers whose metadata is automatically updating, but maybe not for those where the records' metadata
Looking better, thanks....
NANOOS ERDDAP links look cleaned up as well.
the reindex just finished.
I'm still looking into the sensorml2iso stuff
Ok, I think Solr is back to normal now. I'll close this, but we should keep an eye on the ongoing health of the Solr index since it's seemed to have sync issues lately.
I'll open another issue, but it would be good to be able to account for the difference between the Catalog dataset count and Harvest Registry count somewhere. We've discussed before but didn't have a clear plan. I think that About page is the best location.
We have a problem with the Solr index and/or CKAN database integrity.
I'm reviewing NANOOS' datasets ahead of a meeting next week, and I can see several duplicated datasets that should be disallowed by CKAN's unique fileIdentifier restriction.
Here's some examples:
NANOOS OPeNDAP Datasets
NANOOS ERDDAP Datasets.
The OPeNDAP datasets are repeated three times (easy way to see this is there is an extra number [2,3] appended at the end of the URL for each of the duplicates).
The ERDDAP datasets are repeated twice (look for the '2' at the end of URLs).
Wondering if there have been any recent changes to the CKAN harvesting rules that might be cause?
These duplicates should be disallowed because the remote Registry metadata record each of the duplicates point to is the same XML file (so they must have the same fileIdentifier).
Possibly related to ioos/catalog-ckan#147?