internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.1k stars 1.33k forks source link

Author merge stripping Work-level metadata #89

Closed george08 closed 4 years ago

george08 commented 12 years ago

uh oh - we have a problem...

looks like author merge is stripping work metadata :(

http://openlibrary.org/works/OL1087067W/Catalogue_of_the_collection_of_English_porcelain_in_the_Department_of_British_and_mediaeval_antiquities_and_ethnography_of_the_British_Museum?b=5&a=4&_compare=Compare&m=diff

all the author merges by marycee here: http://ol-bots.us.archive.org/cgi-bin/vandalismcenter.py

http://openlibrary.org/people/marycee

george08 commented 12 years ago

I've left marycee's edits in the vandalism bot, where I discovered this. http://ol-bots.us.archive.org/cgi-bin/vandalismcenter.py?author=/people/marycee

Ideas/comments appreciated. (Particularly concerned about when the bug might have been introduced and how widespread it is? Can we find out what's changed in the merge program/scripts, if anything?)

anandology commented 12 years ago

Looked the query it sent to infobase and found that it sent revision 1 of the work page instead of revision 4.

Investigating...

anandology commented 12 years ago

Same problem with another work in the same author-merge.

http://openlibrary.org/works/OL1087071W/A_guide_to_the_antiquities_of_the_bronze_age_in_the_Department_of_British_and_medi%C3%A6val_antiquities?b=6&a=5&_compare=Compare&m=diff

anandology commented 12 years ago

Wrote a script to identify merges with potential loss of data and found 46 of them in 2011.

I haven't been able to identify the cause of this error, but I can fix all these errors.

anandology commented 12 years ago

The earliest error is on 2011-01-08. I think the bug was there from the beginning, unnoticed till now. we don't have infobase write logs before 2011, otherwise we could have identified more. I think is it still possible to write another script that looks at the diffs to identify such errors happened before 2011.

george08 commented 12 years ago

Sounds good, Anand. Thanks.

anandology commented 12 years ago

Wrote to a script to check if memcache has any stale entries. Looks like memcache is alright.

I'm going to put probes into the merge-authors code to dump useful debug info so that I can analyze it if that happens again.

anandology commented 12 years ago

I'm addong some debug information to the store on author merge. We'll be able to find more info about the next bad merge from http://openlibrary.org/admin/inspect/store?type=merge-authors-debug&name=bad_merge&value=true.

LeadSongDog commented 8 years ago

Odd, the history shows 6 revisions https://openlibrary.org/works/OL1087067W/Catalogue_of_the_collection_of_English_porcelain_in_the_Department_of_British_and_mediaeval_antiquit?a=4&_compare=Compare&b=5&m=history

but the history shown at the foot of https://openlibrary.org/works/OL1087067W/Catalogue_of_the_collection_of_English_porcelain_in_the_Department_of_British_and_mediaeval_antiquit has only five; it omits rev 2 from 15 January 2010.

To me, this suggests an off-by-one indexing error of some kind.

hornc commented 7 years ago

This is still happening :( eg: https://openlibrary.org/works/OL1065650W/Der_menschliche_Gedanke?m=diff&b=3 from merge https://openlibrary.org/recentchanges/2017/05/18/merge-authors/51995688

The admin link above still works: https://openlibrary.org/admin/inspect/store?type=merge-authors-debug&name=bad_merge&value=true

hornc commented 7 years ago

Here is the code that checks for bad merges as they are saved: https://github.com/internetarchive/openlibrary/blob/master/openlibrary/plugins/upstream/merge_authors.py#L138

xayhewalo commented 4 years ago

@hornc Is this still an issue? It looks like there are scripts to spot bad merges. Are there any other changes that can be implemented?

mekarpeles commented 4 years ago

@seabelis , @cdrini is this still a problem? If not, feel free to close.

seabelis commented 4 years ago

No. I've not observed this.