Closed george08 closed 4 years ago
I've left marycee's edits in the vandalism bot, where I discovered this. http://ol-bots.us.archive.org/cgi-bin/vandalismcenter.py?author=/people/marycee
Ideas/comments appreciated. (Particularly concerned about when the bug might have been introduced and how widespread it is? Can we find out what's changed in the merge program/scripts, if anything?)
Looked the query it sent to infobase and found that it sent revision 1 of the work page instead of revision 4.
Investigating...
Same problem with another work in the same author-merge.
Wrote a script to identify merges with potential loss of data and found 46 of them in 2011.
I haven't been able to identify the cause of this error, but I can fix all these errors.
The earliest error is on 2011-01-08. I think the bug was there from the beginning, unnoticed till now. we don't have infobase write logs before 2011, otherwise we could have identified more. I think is it still possible to write another script that looks at the diffs to identify such errors happened before 2011.
Sounds good, Anand. Thanks.
Wrote to a script to check if memcache has any stale entries. Looks like memcache is alright.
I'm going to put probes into the merge-authors code to dump useful debug info so that I can analyze it if that happens again.
I'm addong some debug information to the store on author merge. We'll be able to find more info about the next bad merge from http://openlibrary.org/admin/inspect/store?type=merge-authors-debug&name=bad_merge&value=true.
Odd, the history shows 6 revisions https://openlibrary.org/works/OL1087067W/Catalogue_of_the_collection_of_English_porcelain_in_the_Department_of_British_and_mediaeval_antiquit?a=4&_compare=Compare&b=5&m=history
but the history shown at the foot of https://openlibrary.org/works/OL1087067W/Catalogue_of_the_collection_of_English_porcelain_in_the_Department_of_British_and_mediaeval_antiquit has only five; it omits rev 2 from 15 January 2010.
To me, this suggests an off-by-one indexing error of some kind.
This is still happening :( eg: https://openlibrary.org/works/OL1065650W/Der_menschliche_Gedanke?m=diff&b=3 from merge https://openlibrary.org/recentchanges/2017/05/18/merge-authors/51995688
The admin link above still works: https://openlibrary.org/admin/inspect/store?type=merge-authors-debug&name=bad_merge&value=true
Here is the code that checks for bad merges as they are saved: https://github.com/internetarchive/openlibrary/blob/master/openlibrary/plugins/upstream/merge_authors.py#L138
@hornc Is this still an issue? It looks like there are scripts to spot bad merges. Are there any other changes that can be implemented?
@seabelis , @cdrini is this still a problem? If not, feel free to close.
No. I've not observed this.
uh oh - we have a problem...
looks like author merge is stripping work metadata :(
http://openlibrary.org/works/OL1087067W/Catalogue_of_the_collection_of_English_porcelain_in_the_Department_of_British_and_mediaeval_antiquities_and_ethnography_of_the_British_Museum?b=5&a=4&_compare=Compare&m=diff
all the author merges by marycee here: http://ol-bots.us.archive.org/cgi-bin/vandalismcenter.py
http://openlibrary.org/people/marycee