SOLR search results: duplicated CKAN datasets

mwengren commented 8 months ago

I noticed several issues with the Solr search index on a spot check today.

Overall dataset count is too high/duplicated datasets: Take GCOOS for example, which has 59,000+ datasets listed on the default datasets view. If you add up the four individual GCOOS harvest WAFs, the total is only 231 + 156 + 24971 + 906 = 26,264. So there is some dataset duplication. I am not sure if this only affects GCOOS or is more widespread
Poor search results. If I try to search by an individual GCOOS dataset id (see this search for 'Data for ioos-station-wmo-42400'), I get essentially a full list of datasets returned (~76.068 total datasets). The dataset order does appear to be sorted at least (most relevant results at top), but there is essentially no filtering happening on the count in the results set. Also, examining the results list from this search, there are three copies of the data dataset with title 'Data for ioos-station-wmo-42400', so an example of the duplication in the first issue mentioned above.

We need to look into how to restore the proper search functionality in the current version of SOLR and also troubleshoot why some datasets are clearly duplicated.

cc @benjwadams

benjwadams commented 8 months ago

This is duplicated at the database level, so clearing up Solr indices will result in more duplicates the next time AWS is run.

I have a couple system scripts specifically for clearing out database duplicates, although it usually doesn't need to run for 10k+ datasets so it may run somewhat slowly. I'm running it now and it will take a while.

benjwadams commented 8 months ago

Duplicates have been cleared from database and Solr. Closing issue.

mwengren commented 7 months ago

This seems to still be present as of 4/8/24. GCOOS =~ 54000 datasets, Glider DAC =~ 6800.

Still need to diagnose what's causing the counts to be off.

What's the issue with the CKAN database that we're seeing so many duplicates, if it's not a problem in Solr?

Separately, there is also a Solr search results issue. If this should be separated into another issue, let's do that:

Solr is giving poor search filtering results from its index. If I search for 'osu592-20230524T1813-delayed' for example, I get 71,000 results. No filtering applied.

There can't be 71000 datasets with that string occurring in the index, so something is not working that used to work (in earlier Solr versions?).

mwengren commented 7 months ago

I created a new issue https://github.com/ioos/ckanext-ioos-theme/issues/253 to track the Solr issue separately.

mwengren commented 7 months ago

@benjwadams Says that he'll need to clear out the CKAN database and reharvest. Issue may re-occur.

Unclear whether the CKAN database uses the ISO XML title value or XML 'flieIdentifier' field value. We'll need to keep an eye on this to understand how to minimize the consequences of this going forward.

mwengren commented 6 months ago

We're still seeing the issue with inaccurate dataset counts. Here's a tally for GCOOS:

SOLR search results for GCOOS org: 52,847:
CKAN database harvest source dataset counts: 26,352 => GCOOS WAF 231 + GCOOS Biological WAF 904 + GCOOS Historical WAF 25042 + GCOOS ERDDAP WAF 175 = 26,352.

We're getting roughly 2x the count in the SOLR index than is in the database. Coincidence?

@benjwadams If the harvest counts as listed above are coming from the CKAN database, how do you know that this is caused by duplicated datasets in the CKAN database and not an issue in SOLR?

I'm not sure if this affects other providers than GCOOS.

Spot check of AOOS looks better: SOLR dataset count for AOOS: 2,765, harvest source counts: AOOS ERDDAP WAF 2607 + AOOS WAF 127 = 2,734. Would be better if those matched exactly, but this is good enough all things considered.

benjwadams commented 6 months ago

Counts have improved considerably since deduplication scripts have run.

mwengren commented 5 months ago

@benjwadams has manual cleanup scripts that removes duplicates from database first, and then clears corresponding datasets from SOLR index. Checks dataset ID for number value suffixes in ID field and removes if present (this is an indication of a potential duplicate dataset). Not entirely a safe check, but the best that we have.

Duplicates can result from harvest sources that have been removed from CKAN but datasets have not fully cleared out from database as part of the removal process. If the same source is re-harvested afterwards, this can result in duplicate datasets being created.

We can keep this as a manual option to be run if necessary. Minor changes to the script would be needed to automate running routinely.

mwengren commented 3 months ago

@benjwadams A spot check of IOOS Catalog today shows this is happening again, possibly in a snowball-ish sort of way.

When I looked yesterday at https://data.ioos.us/dataset/, there were approx. 80,000 datasets listed in the Solr index.

Today, it's over 90,000K! IOOS has definitely not accumulated an additional 10K datasets in the past day.

Can we look into automating the aforementioned auto-cleanup script (see my previous comment to that effect https://github.com/ioos/ckanext-ioos-theme/issues/252#issuecomment-2145478504), or, alternatively, if there is something within the CKAN harvesting code that could be troubleshot that might prevent the duplicated datasets from being created during the ingest process in the first place?

We can discuss the best path forward at next week's meeting but, for now, can you attempt to clear duplicates from Catalog?

GCOOS: GCOOS again appears to be the worst 'offender' (no blame intended), with over 60K datasets listed by Solr on the datasets page: https://data.ioos.us/dataset/.

The GCOOS Historical WAF source looks to be the main cause - it should have roughly 25K datasets in it but is listing 65,000 + currently: https://data.ioos.us/harvest/gcoos-waf-historical.

SECOORA: SECOORA's datasets are also a problem. In SECOORA's case, however, SOLR yields a count of 15,000 + datasets https://data.ioos.us/dataset/?organization=secoora, whereas the primary ERDDAP harvest source - : https://data.ioos.us/harvest/secoora-erddap - only shows 7900.

So in this case it's both the harvest source being duplicated (SECOORA's ERDDAP only has 1,600 datasets - https://erddap.secoora.org/erddap/index.html) and the SOLR count that's off from the already inflated 7900 dataset count resulting from the harvest.

Overall, it seems our harvesting system isn't holding up to what it's being asked to do, or there's a major bug in CKAN's harvesting code causing all of these duplicates to be generated repeatedly. Either way, we need the Catalog to be more stable.

mwengren commented 1 month ago

As of 10/7, it looks like SECOORA is the org with the most dataset duplicates:

SECOORA ERDDAP WAF harvest source (~5300 datasets): https://data.ioos.us/harvest/secoora-erddap

SECOORA ERDDAP (~16000 datasets): https://erddap.secoora.org/erddap/index.html

Doing a search for 'Indian River' as an example shows multiple datasets with the numerical suffix duplicate situation in the URL ('https://..../dataset', 'dataset2', 'dataset3', etc).

ioos / ckanext-ioos-theme

SOLR search results: duplicated CKAN datasets #252