CKAN Solr index consistency

mwengren commented 7 years ago

We still have some lingering CKAN Solr index issues to sort out. Improvements have been made towards this in the past, but an example of this currently is CDIP.

CDIP lists 140 datasets in the Registry About, but there are 468 listed in CKAN/Solr.

If you search for a particular keyword, many of the results are dead/repeated links to outdated entries in the Solr index.

mwengren commented 7 years ago

@benjwadams Can you look into what is going on with the CDIP harvest/database? The CDIP dataset count keeps increasing, multiples of what it should be in the Registry. It seems something is not being cleared properly. Count is up from 468 last week to 559 now.

benjwadams commented 7 years ago

Hey Micah, I'm looking into a fix for this.

benjwadams commented 7 years ago

It looks like there might be stale records due to truncation of the dataset names CKAN uses? At the moment this is just a hunch and will have to be followed up on.

In the interim, I took the Solr index, grabbed all the names from it, and found any names not stored in the database. A good number of these will 404, as there is no corresponding entry in the database. I checked to see which returned 404, and removed anything from Solr that caused an error. This isn't a permanent fix, but should fix many of the issues with the counts being so high.

There are now 176 datasets matching CDIP in CKAN, compared to 139 listed by Harvest Registry as of this writing. This probably indicates there are still duplicates on the CKAN side, but it should fix many of the stale links which were previously present.

mwengren commented 7 years ago

Just an update on the CDIP dataset count discrepancy.

In the Harvest Registry, there are 285.

In CKAN, there are 450, so we still need to resolve how Solr isn't properly purging records (or something else?).

benjwadams commented 7 years ago

I cleaned out a number of the old/redundant CDIP links today, especially the "Wave Parameter..." models. I've been reharvesting a number of times today so far they haven't popped up. I'll keep an eye on if more duplicates appear.

mwengren commented 7 years ago

@benjwadams the CDIP Harvest Registry (287) and IOOS Catalog (273) look good to me at the moment. The Registry lists 15 dataset harvest errors due to duplicate IDs in the CDIP metadata, so the numbers add up (close enough to satisfy me anyway, let's just say: 287 - 15 = 273 and call the immediate case closed).

One extra record is not a big deal. But, still concerned the issue will crop up again and that difference will grow, so we'll have to keep an eye on it, unless there's a code change to be made we know will prevent recurrence. I'll leave it to you guys to determine that.

mwengren commented 6 years ago

I would love to close this one out finally, but there is still some discrepancy with the CDIP organization. I don't know that I've seen this before, but it seems there are more datasets associated with the CDIP organization than there are in the harvest. Not sure how that can be as we have 1:1 for organization to harvest.

Organization (319): https://data.ioos.us/organization/cdip Harvest (288): https://data.ioos.us/harvest/cdip-waf

The harvest count seems correct though compared to the Registry. Registry lists 303 datasets, with 15 errors in the CKAN harvest due to duplicate IDs found by CKAN. 303 - 15 = 288, so the Registry -> CKAN harvest matches up. Somehow there are just extra datasets affiliated with the CDIP org.

Otherwise, there don't seem to be any more issues with Solr that I can see , from a quick scan of other organizations. @benjwadams can you look into what we might be able to do to clean out the CDIP org of the extra datasets and clear/reharvest the harvest WAF again, then maybe we can finally call this one done.

benjwadams commented 6 years ago

@mwengren, I looked at this more today. It looks like there's something going on with the "Wave Parameter" nowcast data. The harvester appears to be regarding old nowcast data as separate datasets from more up to date ones. See here: cdip_test.txt

Removing these old nowcast data reduced the dataset count in catalog from 304 to 289, compared against the registry's 288. Also note that 304 - 289 = 15, which was also mentioned in your above comment dated October 5, 2017. Not sure where that final one dataset that makes up the discrepancy is coming from.

Notice that for the same area, there are datasets for 12/18/2017 as well as 12/20/2017. For some reason, the harvest process is picking these up as entirely new files. It looks like the global attributes for the summary and time coverage related attributes are changing. I'm not yet sure if the GUID is changing between nowcast runs, but I've archived the current (12/20/2017) nowcast run metadata to compare against future ones.

My hunch is that something is flagging this as a new dataset during the harvest/package info stage:

https://github.com/ckan/ckanext-spatial/blob/master/ckanext/spatial/harvesters/base.py#L167

benjwadams commented 6 years ago

Update: caused by changing "id" attribute in the netCDF metadata which causes CKAN to consider the data in the WAF as a new dataset, even though it points to the same THREDDS URL.

benjwadams commented 6 years ago

I emailed Darren from CDIP (listed as the contact on Catalog Registry) regarding the cause of the issue and he changed the identifiers for the files. Hopefully this should address the duplicate datasets which were being generated. I will check after the weekend, when several harvest cycles will have run.

benjwadams commented 6 years ago

CDIP datasets are now stable at 303 datasets, which matches what's in the registry, so I'm considering this issue fixed.

mwengren commented 6 years ago

I'm reopening this issue because we have pretty widespread duplication of records in CKAN at the moment. A quick look at the about page in the Registry shows almost 2x the dataset count in CKAN vs. Registry. Not sure it's a metadata issue this time though, it seems like Solr or CKAN isn't properly updating matching records.

Some related issues with dataset duplication from the past:

ioos/catalog-ckan#143 ioos/catalog-ckan#148

Catalog dataset count ~25K: https://data.ioos.us/dataset

mwengren commented 6 years ago

@benjwadams I think once you have the CKAN 2.8 up and running, let's see if we can configure that to pull from the Registry WAF on a daily basis to test whether the record duplication issues continue with 2.8 or not.

If that's easier than deploying a separate dev Registry instance for now, that should accomplish some testing of this issue. When users trigger a harvest from the Registry, they'll still be causing the production CKAN to update rather than the 2.8 version, which is what we want anyway until we're ready to switch the harvest targets over.

We'll have to replicate all the CKAN Organization info etc in 2.8 as well.

benjwadams commented 6 years ago

Dev instance is already pulling from registry. For the most part I don't see many repeated records. Some recs also appear to be duplicates, but are SOS records with different URN identifiers, even though they de facto refer to the same station.

I cleared out some duplicate records on Friday. The remaining discrepancies between the record counts on the dev and production server seems to be in part due to date and title changes in forecast/nowcast data.

It may be better to go with a fresh version of the Solr index when moving over to 2.8.

benjwadams commented 6 years ago

Record count is more or less stable at this point. Going to close. If there are future issues with duplicates after the upgrade we can follow up on email or GitHub depending on the nature of the problem.

mwengren commented 6 years ago

@benjwadams I think when we deploy the 2.8 version of the Catalog, we should just do a fresh reharvest of the Registry WAFs, starting from a clean database in CKAN (or at least one without any of the dataset and harvest-related tables populated). And also a newly-initialized Solr index, like you mentioned.

Not sure how straightforward this is to do, but it might help in tracking down possible causes of the duplicated datasets issue.

It seems there's still some duplication going on in the dev instance, but we can address again once we deploy 2.8.

ioos / ckanext-ioos-theme

CKAN Solr index consistency #158