internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.22k stars 1.37k forks source link

Solr updater uses cached item in update, meaning some merge redirects, and deletes are not updated correctly. #927

Closed hornc closed 3 years ago

hornc commented 6 years ago

If a deleted author is still cached in its original un-deleted state, the solr update will think it exists and not remove it :( I think it's in https://github.com/internetarchive/openlibrary/blob/fc873f2550b3a510399a8a01de1cc428ab074b17/openlibrary/solr/data_provider.py#L170

This affects author merges where if a duplicate author page is not manually re-loaded to trigger the redirect after the merge, the solr update code will pull it from the cache as still active and not send the <delete> to solr, and will actually send an <add> instead.

To work around this for scripted deletes: I now delete, then immediately request the author record again and confirm it is a /type/delete , and that appears to update the cache as used by the solr updater. I have not noticed this as a problem for works.

hornc commented 6 years ago

noticed while working on https://github.com/internetarchive/openlibrary-client/issues/74 , see comment there

tfmorris commented 6 years ago

The caching logic seems weird/wrong to me:

LeadSongDog commented 5 years ago

The same or similar problem is there for redirected works. https://openlibrary.org/search?q=Ed+OL24162W&mode=everything finds and shows just the redirect. https://openlibrary.org/search?q=OL24162W&mode=everything shows just the target, OL15923277W. Both the redirect and the target show in the “what work is this an edition of” dropdown, which is very confusing for users.

LeadSongDog commented 5 years ago

@hornc Here's another ugly one post author-merge: https://openlibrary.org/search/authors?q=Neil+Gaiman&mode=everything All the book counts shown are wrong. Only the first one shown ( https://openlibrary.org/authors/OL53305A ) is reachable. It should show "302 books". The other authors, if shown at all, should display "0 books".

xayhewalo commented 4 years ago

So is this a Solr problem, a Memcache problem, or a combination of the two? Will #2246 affect/fix this?

tfmorris commented 4 years ago

Mainly a Solr problem. Having a fresh index and matching code that was used to create it will help debug this problem if it still exists (I suspect it might), but it won't be solved by #2246 per se.

LeadSongDog commented 3 years ago

@xayhewalo The impact of this bug is much wider than stated above. Merged-from authors still appear in author search results and in the author autocomplete drop-down. Very ugly. Priority 3 doesn’t really do it justice.

cdrini commented 3 years ago

I believe this is also causing multiple edits to a work in a short time frame to be ignored :/ Bumping in priority.