internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.19k stars 1.35k forks source link

Imports false matching on incorrect LCCNs in source MARC records #2304

Open cdrini opened 5 years ago

cdrini commented 5 years ago

Description

Clean Up Bot matched the existing edition of Kolonialismus und Neokolonialismus in Nordafrika und Nahost with Tales of horror and suspense; this seems incorrect.

Relevant url?

Work Edition Diff Date of Edit
OL19686422W OL5528417M https://openlibrary.org/books/OL5528417M/-?b=2&a=1&_compare=Compare&m=diff May 19, 2019

Expectation

Probably a new edition should have been created

Stakeholders

@hornc @seabelis

cdrini commented 5 years ago

@seabelis Feel free to add any other examples you find.

tfmorris commented 5 years ago

Probably a new edition should have been created

Probably?

The MARC record is here (why isn't it properly linked like the Scriblio MARC record?)

I can't see anything in the data that hints at how "Clean up Bot" (doing imports, not cleanups), got quite so far afield.

hornc commented 5 years ago

The source MARC record has the same LCCN as the original record, so they were matched by OL's internal data matching, OL did the matching not the bot, just to be clear.

https://openlibrary.org/show-records/marc_openlibraries_sanfranciscopubliclibrary/sfpl_chq_2018_12_24_run02.mrc:57025904:1594

Shows 010$a of 67002735 , which the original edition also has: LC Control Number: 67002735

So that's why the match occurred. The MARC record looks to be wrong, and our edition had it correct https://catalog.loc.gov/vwebv/search?searchCode=LCCN&searchArg=67002735&searchType=1&permalink=y

CleanUp bot "repaired" an orphaned edition in this instance, but bad data triggered a mismatch because we do rank library LC control numbers high when it come to determining matches.

LeadSongDog commented 5 years ago

Looking closely at https://openlibrary.org/show-records/marc_openlibraries_sanfranciscopubliclibrary/sfpl_chq_2018_12_24_run02.mrc:57025904:1594 it appears than OL was the knock-on victim of an error (since corrected) in WorldCat

tfmorris commented 5 years ago

Strong identifiers should carry a lot of weight, but they shouldn't outweigh everything else in scoring. Matching identifiers plus titles with no words in common is not a match.

Where in the code does this matching occur? Where in CleanUp bot is it called from?

hornc commented 4 years ago

CleanUp bot only uses the import API, and has been mostly trying to re-import to fix bad data. I'm endeavouring to use ImportBot as the account id for ongoing imports, but everything I'm trigger is all going through the same API https://github.com/internetarchive/openlibrary/wiki/Endpoints#import-by-archiveorg-reference

The code that matches on LCCN is here: https://github.com/internetarchive/openlibrary/blob/a53f9018ed388449ba0c998a1880a37f5dafcbe8/openlibrary/catalog/add_book/__init__.py#L367

This code does look a bit light, and I'm beginning to doubt the value of this early_exit(rec) method. It seems to either over-simplify things, or fail to find matches, when there are better and more sophisticated record matching checks later. I don't see any evidence that it provides value in speeding up record matching.

seabelis commented 4 years ago

Related or duplicate of #2223?

tfmorris commented 4 years ago

I'm beginning to doubt the value of this early_exit(rec) method. It seems to either over-simplify things, or fail to find matches, when there are better and more sophisticated record matching checks later. I don't see any evidence that it provides value in speeding up record matching.

I suspect the performance difference is negligible, but even if it isn't, correctness is far, far more important than performance.

seabelis commented 4 years ago

LCCN works as a work match, but not so much as an edition match unless the publisher and year are taken into account. The original LCCN is frequently included in later editions. If a publisher releases a book under a different imprint ten years after the original, this should not count as a match.

hornc commented 4 years ago

@seabelis I'm trying to find clear confirmation of what level LCCN applies to, it's proving harder than I expected. http://www.loc.gov/publish/pcn/about/scope.html Is the best I've found so far. I think you are correct though, which has big implications for other LCCN matching work I'm doing atm. I'll need to check to see how this works in practice since the copy that gets cataloged is supposed to be the "best edition" available (at first publication?), so that points to a single edition, but it does sound like the id equally applies to all other physical formats, and presumably subsequent editions...

I'm going to look for some examples of the same LCCN applied to different editions by different publishers and in different years - please let me know if you know of some.

Looks like I'll need to re-evaluate LCCN matching :( but thanks for bringing this to my attention @seabelis !

seabelis commented 4 years ago

I can definitely find some of these for you.

seabelis commented 4 years ago

https://openlibrary.org/books/OL20945423M/The_Night_Gardener https://openlibrary.org/books/OL9406282M/The_Night_Gardener

https://openlibrary.org/books/OL18361532M/The_Return_of_The_King https://openlibrary.org/books/OL15350852M/The_Return_of_the_King

https://openlibrary.org/books/OL24208261M/The_Glass_Menagerie https://openlibrary.org/books/OL21849978M/The_Glass_Menagerie

https://openlibrary.org/books/OL24952872M/Quicksilver https://openlibrary.org/books/OL3564771M/Quicksilver

https://openlibrary.org/books/OL10352639M/The_Road https://openlibrary.org/books/OL15610563M/The_Road https://openlibrary.org/books/OL24087493M/The_Road https://openlibrary.org/books/OL26328791M/The_Road (these last two have the added bonus of having different interior and cover ISBNs)

seabelis commented 4 years ago

More where these came from if you need them.

seabelis commented 4 years ago

A few more for you:

https://openlibrary.org/books/OL24295500M/The_Bluest_Eye https://openlibrary.org/books/OL23261835M/The_Bluest_Eye https://openlibrary.org/books/OL13629280M/The_Bluest_Eye

https://openlibrary.org/books/OL24375625M/The_Bluest_Eye https://openlibrary.org/books/OL4454628M/The_Bluest_Eye

https://openlibrary.org/books/OL26328210M/The_Celestine_Prophecy https://openlibrary.org/books/OL14438914M/The_Celestine_Prophecy

https://openlibrary.org/books/OL24280644M/The_Adventures_of_Huckleberry_Finn https://openlibrary.org/books/OL24374278M/The_Adventures_of_Huckleberry_Finn

https://openlibrary.org/books/OL7351190M/Death_of_a_Salesman https://openlibrary.org/books/OL4905012M/Death_of_a_Salesman

https://openlibrary.org/books/OL19013338M/Death_of_a_Salesman https://openlibrary.org/books/OL24204875M/Death_of_a_salesman https://openlibrary.org/books/OL7355755M/Death_of_a_Salesman https://openlibrary.org/books/OL8136299M/Death_of_a_Salesman

https://openlibrary.org/books/OL18350756M/The_Crucible https://openlibrary.org/books/OL6133866M/The_Crucible https://openlibrary.org/books/OL7640926M/The_Crucible

https://openlibrary.org/books/OL22639887M/The_Lion_the_Witch_and_the_Wardrobe https://openlibrary.org/books/OL1401062M/The_Lion_the_Witch_and_the_Wardrobe https://openlibrary.org/books/OL24212580M/The_Lion_the_Witch_and_the_Wardrobe

https://openlibrary.org/books/OL23270937M/The_Fortress_of_Solitude https://openlibrary.org/books/OL24376321M/The_Fortress_of_Solitude

https://openlibrary.org/books/OL9894159M/Digital_Fortress https://openlibrary.org/books/OL21999841M/Digital_Fortress https://openlibrary.org/books/OL17948330M/Digital_Fortress

https://openlibrary.org/books/OL8065046M/The_Da_Vinci_Code https://openlibrary.org/books/OL3308405M/The_Da_Vinci_Code

I suppose it's worth noting that a new LCCN is sometimes issued, but after the relevant edition has gone into print. So this means there are sometimes editions with an older LCCN printed on the copyright page than the one that actually applies to it specifically.

seabelis commented 4 years ago

These would be a match but for ISBN and format: https://openlibrary.org/books/OL4896569M/Adventures_of_Huckleberry_Finn https://openlibrary.org/books/OL7451558M/Adventures_of_Huckleberry_Finn

hornc commented 4 years ago

Relates to #2865 although I'm not sure the PR changes anything with respect to this issue, unfortunately.