Duplicate datasets for NDBC, CO-OPS

ioos / ckanext-ioos-theme

IOOS Catalog as a CKAN extension

GNU Affero General Public License v3.0

7 stars 14 forks source link

Duplicate datasets for NDBC, CO-OPS #143

Closed mwengren closed 7 years ago

mwengren commented 7 years ago

Looks like we have duplicates being harvested for at least CO-OPS and NDBC. Example for NDBC and CO-OPS.

The counts in the Registry look correct: https://registry.ioos.us/about. Looks like it might be a Solr problem since the harvest job has the correct count: https://data.ioos.us/harvest/admin/noaa-co-ops-waf.

lukecampbell commented 7 years ago

I am resync'ing solr, Based on a mean time of 10s to reindex one record I estimate it will take 22h to completion. This is a very rough estimate.

lukecampbell commented 7 years ago

Ok, the good news the estimate was WAY off. I just checked it and it's done. So that's great news.

lukecampbell commented 7 years ago

Looks like there are still dangling records after the reindex. I'll need to look into this further.

lukecampbell commented 7 years ago

There were a bunch of failed harvest jobs in Feb. I wonder if there are duplicates because the harvest job crashed instead of finished. I'm going to try to clear the harvest and reharvest to see if that clears it up.

benjwadams commented 7 years ago

@mwengren, last week I manually cleaned up a lot of duplicates via some shell magic. Could you please take a look and see if most aren't gone now? I was going by dataset title, so sometimes there are datasets with the same name, but which correspond to different time coverages, which I was not trying to eliminate, i.e. datasets such as these: https://data.ioos.us/dataset?q=newport+automated&sort=title_string+asc&ext_bbox=&ext_prev_extent=-154.68749999999997%2C-80.17871349622823%2C154.68749999999997%2C80.17871349622823

mwengren commented 7 years ago

@benjwadams @lukecampbell I think it looks like the original duplicate issue was fixed by the clear/reharvest. Those SCCOOS duplicates are OK due to different time coverage, although ideally their THREDDS catalog would spell that out in the dataset titles.

So the conclusion was that the cause was failed harvests causing duplicated records in the CKAN database. Solr was actually OK? I only noticed the CO-OPS and NDBC duplicates, looked like there were double the # of datasets for each.