ioos / ckanext-ioos-theme

IOOS Catalog as a CKAN extension
GNU Affero General Public License v3.0
7 stars 14 forks source link

SOLR search results: duplicated CKAN datasets #252

Closed mwengren closed 1 month ago

mwengren commented 3 months ago

I noticed several issues with the Solr search index on a spot check today.

  1. Overall dataset count is too high/duplicated datasets: Take GCOOS for example, which has 59,000+ datasets listed on the default datasets view. If you add up the four individual GCOOS harvest WAFs, the total is only 231 + 156 + 24971 + 906 = 26,264. So there is some dataset duplication. I am not sure if this only affects GCOOS or is more widespread

  2. Poor search results. If I try to search by an individual GCOOS dataset id (see this search for 'Data for ioos-station-wmo-42400'), I get essentially a full list of datasets returned (~76.068 total datasets). The dataset order does appear to be sorted at least (most relevant results at top), but there is essentially no filtering happening on the count in the results set. Also, examining the results list from this search, there are three copies of the data dataset with title 'Data for ioos-station-wmo-42400', so an example of the duplication in the first issue mentioned above.

We need to look into how to restore the proper search functionality in the current version of SOLR and also troubleshoot why some datasets are clearly duplicated.

cc @benjwadams

benjwadams commented 3 months ago

This is duplicated at the database level, so clearing up Solr indices will result in more duplicates the next time AWS is run.

I have a couple system scripts specifically for clearing out database duplicates, although it usually doesn't need to run for 10k+ datasets so it may run somewhat slowly. I'm running it now and it will take a while.

benjwadams commented 3 months ago

Duplicates have been cleared from database and Solr. Closing issue.

mwengren commented 2 months ago

This seems to still be present as of 4/8/24. GCOOS =~ 54000 datasets, Glider DAC =~ 6800.

Still need to diagnose what's causing the counts to be off.

What's the issue with the CKAN database that we're seeing so many duplicates, if it's not a problem in Solr?

Separately, there is also a Solr search results issue. If this should be separated into another issue, let's do that:

Solr is giving poor search filtering results from its index. If I search for 'osu592-20230524T1813-delayed' for example, I get 71,000 results. No filtering applied.

There can't be 71000 datasets with that string occurring in the index, so something is not working that used to work (in earlier Solr versions?).

mwengren commented 2 months ago

I created a new issue https://github.com/ioos/ckanext-ioos-theme/issues/253 to track the Solr issue separately.

mwengren commented 2 months ago

@benjwadams Says that he'll need to clear out the CKAN database and reharvest. Issue may re-occur.

Unclear whether the CKAN database uses the ISO XML title value or XML 'flieIdentifier' field value. We'll need to keep an eye on this to understand how to minimize the consequences of this going forward.

mwengren commented 1 month ago

We're still seeing the issue with inaccurate dataset counts. Here's a tally for GCOOS:

We're getting roughly 2x the count in the SOLR index than is in the database. Coincidence?

@benjwadams If the harvest counts as listed above are coming from the CKAN database, how do you know that this is caused by duplicated datasets in the CKAN database and not an issue in SOLR?

I'm not sure if this affects other providers than GCOOS.

Spot check of AOOS looks better: SOLR dataset count for AOOS: 2,765, harvest source counts: AOOS ERDDAP WAF 2607 + AOOS WAF 127 = 2,734. Would be better if those matched exactly, but this is good enough all things considered.

benjwadams commented 1 month ago

Counts have improved considerably since deduplication scripts have run.

mwengren commented 1 month ago

@benjwadams has manual cleanup scripts that removes duplicates from database first, and then clears corresponding datasets from SOLR index. Checks dataset ID for number value suffixes in ID field and removes if present (this is an indication of a potential duplicate dataset). Not entirely a safe check, but the best that we have.

Duplicates can result from harvest sources that have been removed from CKAN but datasets have not fully cleared out from database as part of the removal process. If the same source is re-harvested afterwards, this can result in duplicate datasets being created.

We can keep this as a manual option to be run if necessary. Minor changes to the script would be needed to automate running routinely.