ioos / catalog

IOOS Catalog general repo for documentation and issues
https://ioos.github.io/catalog/
MIT License
2 stars 6 forks source link

Investigate duplicate datasets not removed by "clear duplicates" scripts #51

Closed rsignell-usgs closed 6 years ago

rsignell-usgs commented 7 years ago

@kknee or @mwengren can you please have someone run the "clear duplicates" script referenced by @lukecampbell here:

https://github.com/ioos/catalog/issues/42#issuecomment-295267423

I wanted to give an IOOS Catalog demo in a webinar tommorrow at 4pm and currently I'm getting lots of duplicates.

For example, I'm getting 3 copies of this ROMS dataset with this query.

2017-08-21_17-15-20

I checked the OPeNDAP endpoints on these 3 datasets and they are identical: http://tds.marine.rutgers.edu/thredds/dodsC/roms/espresso/2013_da/his/ESPRESSO_Real-Time_v2_History_Best.html

mwengren commented 7 years ago

They will probably see it anyway, but in case not, just including @ericmbernier and @benjwadams who will be taking care of this most likely.

ericmbernier commented 7 years ago

I took a look at this. Ben is unfortunately out today, and I am not aware of the script Luke references in Rich’s referenced issue. I asked others who work on IOOS projects here, and they are not aware of the script either. We all agreed it be best we not poke around a production server and run any rogue, undocumented scripts with a demo in such close proximity. Sorry for the inconvenience this may cause.

rsignell-usgs commented 7 years ago

Is it easy to get a list of all the WAF URLs that are harvested by the catalog?

benjwadams commented 7 years ago

@rsignell-usgs , the duplicate elimination scripts are run periodically as is, so either they aren't catching this particular case or there's some other cause.

https://data.ioos.us/harvest has the list of WAFs we harvest from. I don't know if there is access to this information through the CKAN API as well.

rsignell-usgs commented 7 years ago

@benjwadams, do you want me to raise another issue like "investigate why duplicate elimination script is not eliminating these duplicates"?

lukecampbell commented 7 years ago

The "duplicates" that the scripts clean up are from failed harvests. When a harvest is run, if it fails for whatever reason the datasets that it attempted to harvest remain in CKAN until a third party cleans them up, namely us.

The specific SQL command used is here: https://github.com/ioos/catalog-docker-ckan/blob/master/contrib/scripts/cleanup_duplicates.sql

You can safely run it just about any time, or at least you could. I don't know how the system has changed since.

benjwadams commented 7 years ago

@rsignell-usgs, I'll just rename the issue title and keep this one open.

benjwadams commented 7 years ago

@lukecampbell, the script should be running regularly, as it's in the /etc/crontab file: 25 0 * * * root psql -h postgis -U ckanadmin ckan -f /scripts/cleanup_duplicates.sql 2>&1 | /usr/bin/logger -t cleanup_duplicates

I tried running the query manually as well and then refreshing the Solr index. Looked at a request approximating Rich's original in the screenshot, namely: https://data.ioos.us/dataset?q=ROMS+-OLD+-Averages&organization=maracoos&res_format=OPeNDAP&tags=sea_water_potential_temperature Doesn't seem to have fixed it, so there must be something else at play here.

lukecampbell commented 7 years ago

looks like the WAF has two entries with same name and description. Could be duplicates in the WAF?

mwengren commented 7 years ago

Did some troubleshooting in the Registry and CKAN.

There are four separate 'ROMS' records published by MARACOOS in the Registry.

http://tds.marine.rutgers.edu/thredds/iso/roms/espresso/2009_da/avg?catalog=http%3A%2F%2Ftds.marine.rutgers.edu%2Fthredds%2Froms%2Fespresso%2F2009_da%2Fcatalog.html&dataset=espresso_2009_da_averages

http://tds.marine.rutgers.edu/thredds/iso/roms/espresso/2009_da/his?catalog=http%3A%2F%2Ftds.marine.rutgers.edu%2Fthredds%2Froms%2Fespresso%2F2009_da%2Fcatalog.html&dataset=espresso_2009_da_history

http://tds.marine.rutgers.edu/thredds/iso/roms/espresso/2013_da/avg/ESPRESSO_Real-Time_v2_Averages_Best?catalog=http%3A%2F%2Ftds.marine.rutgers.edu%2Fthredds%2Fcatalog%2Froms%2Fespresso%2F2013_da%2Favg%2Fcatalog.xml&dataset=roms%2Fespresso%2F2013_da%2Favg%2FESPRESSO_Real-Time_v2_Averages_Best

http://tds.marine.rutgers.edu/thredds/iso/roms/espresso/2013_da/his/ESPRESSO_Real-Time_v2_History_Best?catalog=http%3A%2F%2Ftds.marine.rutgers.edu%2Fthredds%2Fcatalog%2Froms%2Fespresso%2F2013_da%2Fhis%2Fcatalog.html&dataset=roms%2Fespresso%2F2013_da%2Fhis%2FESPRESSO_Real-Time_v2_History_Best

But these are becoming multiplied 3x in the Solr search results in CKAN. 12 records returned.

So there is an issue with the Solr index not being updated properly, or duplicating search record results 3x.

I don't see any evidence outside MARACOOS (but I didn't look that closely I admit). We may need to look into issues with Solr. I wonder if the pycsw search results have similar duplication? I doubt it, but haven't looked.

lukecampbell commented 7 years ago

A while back we made the decision to remove the restriction on CKAN to allow records with same titles to go through, I wonder if that decision is showing consequences. There may be some internal mechanics of solr that don't behave well when it's allowed? Like maybe there's some code somewhere that keys off the title instead of the entity relationships.

mwengren commented 7 years ago

Same titles or same fileIdentifiers? Same titles is probably fine, but I don't remember removing the fileIdentifier restriction. Since we're able to report that back to users in the Registry the CKAN harvest job results including those errors, we should still disallow those duplicates.

I found another example of duplicate records though similar to the ROMS results (at least in Solr - can't check the db): https://data.ioos.us/organization/hf-radar-dac?q=gulf+coast+scripps&sort=score+desc%2C+metadata_modified+desc

HFR records from SCCOOS are duplicated, but not NDBC (https://data.ioos.us/organization/hf-radar-dac?q=+gulf+coast+NDBC&sort=score+desc%2C+metadata_modified+desc).

So there is clearly an issue with some records having duplicates either in the CKAN db or Solr index. Need to troubleshoot where and how to resolve.

benjwadams commented 7 years ago

There are solr indexes with duplicate names for some reason. Output of curl -s "localhost:3001/solr/ckan/query?q=name:near-real-time-surface-ocean-velocity-u-s-eastand-gulf-coast-1-km-resolution2&rows=1000000" 2> /dev/null | sed -e 's/\\\+"/"/g' -e 's/"{"/{"/g' -e 's/"}"/"}/' -e 's/"\[{"/[{"/g' -e 's/"}\]"/"}]/g' -e 's/\]"/]/g' -e 's/"\[/[/g' -e 's/}"/}/g' -e 's/\\\\n/ /g' | jq . > /tmp/solr_out.txt below. The sed expressions were necessary to pretty-print the embedded json-like CKAN attributes.

https://gist.github.com/benjwadams/739b9014abf72eec9090ea371465ff24

mwengren commented 7 years ago

Possibly related ioos/catalog-ckan#158

benjwadams commented 7 years ago

I cleared out a bunch of duplicate records from Solr this morning. The duplicates in the example given are gone. A couple harvest cycles completed and they're not recurring for the time being, but I'll keep an eye out for any duplicates which might pop up.

mwengren commented 6 years ago

This seems to be resolved now. Closing.