Open cdrini opened 5 years ago
@seabelis Feel free to add any other examples you find.
Probably a new edition should have been created
Probably?
The MARC record is here (why isn't it properly linked like the Scriblio MARC record?)
I can't see anything in the data that hints at how "Clean up Bot" (doing imports, not cleanups), got quite so far afield.
The source MARC record has the same LCCN as the original record, so they were matched by OL's internal data matching, OL did the matching not the bot, just to be clear.
Shows 010$a of 67002735 , which the original edition also has: LC Control Number: 67002735
So that's why the match occurred. The MARC record looks to be wrong, and our edition had it correct https://catalog.loc.gov/vwebv/search?searchCode=LCCN&searchArg=67002735&searchType=1&permalink=y
CleanUp bot "repaired" an orphaned edition in this instance, but bad data triggered a mismatch because we do rank library LC control numbers high when it come to determining matches.
Looking closely at https://openlibrary.org/show-records/marc_openlibraries_sanfranciscopubliclibrary/sfpl_chq_2018_12_24_run02.mrc:57025904:1594 it appears than OL was the knock-on victim of an error (since corrected) in WorldCat
Strong identifiers should carry a lot of weight, but they shouldn't outweigh everything else in scoring. Matching identifiers plus titles with no words in common is not a match.
Where in the code does this matching occur? Where in CleanUp bot is it called from?
CleanUp bot only uses the import API, and has been mostly trying to re-import to fix bad data. I'm endeavouring to use ImportBot as the account id for ongoing imports, but everything I'm trigger is all going through the same API https://github.com/internetarchive/openlibrary/wiki/Endpoints#import-by-archiveorg-reference
The code that matches on LCCN is here: https://github.com/internetarchive/openlibrary/blob/a53f9018ed388449ba0c998a1880a37f5dafcbe8/openlibrary/catalog/add_book/__init__.py#L367
This code does look a bit light, and I'm beginning to doubt the value of this early_exit(rec)
method. It seems to either over-simplify things, or fail to find matches, when there are better and more sophisticated record matching checks later. I don't see any evidence that it provides value in speeding up record matching.
Related or duplicate of #2223?
I'm beginning to doubt the value of this early_exit(rec) method. It seems to either over-simplify things, or fail to find matches, when there are better and more sophisticated record matching checks later. I don't see any evidence that it provides value in speeding up record matching.
I suspect the performance difference is negligible, but even if it isn't, correctness is far, far more important than performance.
LCCN works as a work match, but not so much as an edition match unless the publisher and year are taken into account. The original LCCN is frequently included in later editions. If a publisher releases a book under a different imprint ten years after the original, this should not count as a match.
@seabelis I'm trying to find clear confirmation of what level LCCN applies to, it's proving harder than I expected. http://www.loc.gov/publish/pcn/about/scope.html Is the best I've found so far. I think you are correct though, which has big implications for other LCCN matching work I'm doing atm. I'll need to check to see how this works in practice since the copy that gets cataloged is supposed to be the "best edition" available (at first publication?), so that points to a single edition, but it does sound like the id equally applies to all other physical formats, and presumably subsequent editions...
I'm going to look for some examples of the same LCCN applied to different editions by different publishers and in different years - please let me know if you know of some.
Looks like I'll need to re-evaluate LCCN matching :( but thanks for bringing this to my attention @seabelis !
I can definitely find some of these for you.
https://openlibrary.org/books/OL20945423M/The_Night_Gardener https://openlibrary.org/books/OL9406282M/The_Night_Gardener
https://openlibrary.org/books/OL18361532M/The_Return_of_The_King https://openlibrary.org/books/OL15350852M/The_Return_of_the_King
https://openlibrary.org/books/OL24208261M/The_Glass_Menagerie https://openlibrary.org/books/OL21849978M/The_Glass_Menagerie
https://openlibrary.org/books/OL24952872M/Quicksilver https://openlibrary.org/books/OL3564771M/Quicksilver
https://openlibrary.org/books/OL10352639M/The_Road https://openlibrary.org/books/OL15610563M/The_Road https://openlibrary.org/books/OL24087493M/The_Road https://openlibrary.org/books/OL26328791M/The_Road (these last two have the added bonus of having different interior and cover ISBNs)
More where these came from if you need them.
A few more for you:
https://openlibrary.org/books/OL24295500M/The_Bluest_Eye https://openlibrary.org/books/OL23261835M/The_Bluest_Eye https://openlibrary.org/books/OL13629280M/The_Bluest_Eye
https://openlibrary.org/books/OL24375625M/The_Bluest_Eye https://openlibrary.org/books/OL4454628M/The_Bluest_Eye
https://openlibrary.org/books/OL26328210M/The_Celestine_Prophecy https://openlibrary.org/books/OL14438914M/The_Celestine_Prophecy
https://openlibrary.org/books/OL24280644M/The_Adventures_of_Huckleberry_Finn https://openlibrary.org/books/OL24374278M/The_Adventures_of_Huckleberry_Finn
https://openlibrary.org/books/OL7351190M/Death_of_a_Salesman https://openlibrary.org/books/OL4905012M/Death_of_a_Salesman
https://openlibrary.org/books/OL19013338M/Death_of_a_Salesman https://openlibrary.org/books/OL24204875M/Death_of_a_salesman https://openlibrary.org/books/OL7355755M/Death_of_a_Salesman https://openlibrary.org/books/OL8136299M/Death_of_a_Salesman
https://openlibrary.org/books/OL18350756M/The_Crucible https://openlibrary.org/books/OL6133866M/The_Crucible https://openlibrary.org/books/OL7640926M/The_Crucible
https://openlibrary.org/books/OL22639887M/The_Lion_the_Witch_and_the_Wardrobe https://openlibrary.org/books/OL1401062M/The_Lion_the_Witch_and_the_Wardrobe https://openlibrary.org/books/OL24212580M/The_Lion_the_Witch_and_the_Wardrobe
https://openlibrary.org/books/OL23270937M/The_Fortress_of_Solitude https://openlibrary.org/books/OL24376321M/The_Fortress_of_Solitude
https://openlibrary.org/books/OL9894159M/Digital_Fortress https://openlibrary.org/books/OL21999841M/Digital_Fortress https://openlibrary.org/books/OL17948330M/Digital_Fortress
https://openlibrary.org/books/OL8065046M/The_Da_Vinci_Code https://openlibrary.org/books/OL3308405M/The_Da_Vinci_Code
I suppose it's worth noting that a new LCCN is sometimes issued, but after the relevant edition has gone into print. So this means there are sometimes editions with an older LCCN printed on the copyright page than the one that actually applies to it specifically.
These would be a match but for ISBN and format: https://openlibrary.org/books/OL4896569M/Adventures_of_Huckleberry_Finn https://openlibrary.org/books/OL7451558M/Adventures_of_Huckleberry_Finn
Relates to #2865 although I'm not sure the PR changes anything with respect to this issue, unfortunately.
Description
Clean Up Bot matched the existing edition of Kolonialismus und Neokolonialismus in Nordafrika und Nahost with Tales of horror and suspense; this seems incorrect.
Relevant url?
Expectation
Probably a new edition should have been created
Stakeholders
@hornc @seabelis