Imports from BWB adding appending existing records with bogus data

seabelis commented 7 months ago

Problem

Recently I've noticed several cases of imported data from BWB incorrectly appending existing records. This data is then matched with scanned IA items creating incorrect metadata on the scanned items.

Evidence / Screenshot

Here's one example of many.

Relevant URL(s)

https://openlibrary.org/books/OL49827495M/Hollywood?b=2&a=1&_compare=Compare&m=diff https://openlibrary.org/works/OL36891182W/Hollywood?_compare=Compare&b=2&a=1&m=diff

Reproducing the bug

Go to ...
Do ...

Expected behavior:
Actual behavior:

Context

Browser (Chrome, Safari, Firefox, etc):
OS (Windows, Mac, etc):
Logged in (Y/N): Y
Environment (prod, dev, local): prod

Notes from this Issue's Lead

Proposal & constraints

Related files

Stakeholders

@scottbarnes

seabelis commented 7 months ago

https://openlibrary.org/books/OL46952972M/Der_Aufstieg?b=3&a=1&_compare=Compare&m=diff https://openlibrary.org/works/OL34629474W/Der_Aufstieg?b=2&a=1&_compare=Compare&m=diff

seabelis commented 7 months ago

https://openlibrary.org/books/OL46909482M/Mightier_Than_the_Sword?b=3&a=1&_compare=Compare&m=diff https://openlibrary.org/works/OL34593202W/Mightier_Than_the_Sword?b=3&a=1&_compare=Compare&m=diff

mekarpeles commented 7 months ago

@scottbarnes wonders if we may have logic on our core catalog for importing that considers similar titles more eagerly than we may want. Next step is investigating based on @seabelis's examples.

hornc commented 6 months ago

Example: https://openlibrary.org/books/OL49827495M

The initial import (from promise item) added a book with title + subtitle, and ASIN only (no author, publisher, or date to disambiguate). https://openlibrary.org/books/OL49827495M/Hollywood?v=1
The second import, https://openlibrary.org/books/OL49827495M/Hollywood?_compare=Compare&b=2&a=1&m=diff, matched on title, and there were no other specific indicators on the record to show a mismatch.

Matching on just title without subtitle is a deliberate feature for matching because sometime subtitles are not catalogued consistently, but this strategy generally assumes there will be an author or something else to prevent false matches.

Example: https://openlibrary.org/books/OL46952972M

Initial import (again from a promise item) had title + author (but no date) https://openlibrary.org/books/OL46952972M/Der_Aufstieg?v=1
MARC record: had title, date, and publisher, but no author: https://openlibrary.org/show-records/harvard_bibliographic_metadata/ab.bib.00.20150123.full.mrc:714234488:1036

Example: https://openlibrary.org/books/OL46909482M

Initial import (another promise item) title + subtitle + ASIN only. https://openlibrary.org/books/OL46909482M/Mightier_Than_the_Sword?v=1
Subsequent import matched on title, and filled in other data. https://openlibrary.org/books/OL46909482M/Mightier_Than_the_Sword?_compare=Compare&b=3&a=2&m=diff

Initial thoughts

Generally we already prefer that book imports have at least title, author, and date, perhaps a publisher to help disambiguate. An edition record without a date isn't really specific enough to be disambiguated at all.

This problem is occurring because there are records that are effectively only title on one side of the match.

The MARC record example was an interesting case of no author in a MARC record, but it did have a date, so a date on the other side would have helped disambiguate.

I think the root cause of these problems is what Feature/make affiliate server look up non isbn 10 asins / #8903 is trying to solve or at least mitigate.

Having as much disambiguating metadata in the original OL record will reduce these kinds of false matches. A unexpanded ASIN on the OL record isn't much use as it will never match a MARC record, and is unlikely to even match another ASIN on import. ISBNs, authors, and dates need to be extracted from an ASIN if that's the main concrete identifier we have.

In practice, indicated by these examples, it seems the date is the single most important metadata field that is needed to prevent these issues that can be expected to exist. (Author would be good, but there seem to be legitimate cases where this is not available).

hornc commented 4 months ago

The reported example: https://openlibrary.org/books/OL49827495M/Hollywood?b=2&a=1&_compare=Compare&m=diff https://openlibrary.org/works/OL36891182W/Hollywood?_compare=Compare&b=2&a=1&m=diff

Is a case where a promise item book was imported with only a title + subtitle and ASIN. No date, no publisher etc.

It is very difficult for any process to recognize what this edition is supposed to be.

The UI would have (or at least should have) prevented a user from adding this record in the first place.

The second example:

https://openlibrary.org/books/OL46952972M/Der_Aufstieg?b=3&a=1&_compare=Compare&m=diff https://openlibrary.org/works/OL34629474W/Der_Aufstieg?b=2&a=1&_compare=Compare&m=diff

Is also a promise item import without a date or other distinguishing metadata, which was later falsely matched to a MARC import.

Third example:

https://openlibrary.org/books/OL46909482M/Mightier_Than_the_Sword?b=3&a=1&_compare=Compare&m=diff https://openlibrary.org/works/OL34593202W/Mightier_Than_the_Sword?b=3&a=1&_compare=Compare&m=diff

Also a promise item, very light metadata, no usable date, was mismatched by a later import with more metadata.

The distinguishing feature of the original records is that they barely identify what book they are.

They do have BWB barcodes, but that can only link the one copy which has that barcode to the title (and since there is no other metadata, it is just effectively a title lookup).

Some of the examples have subtitles which seems to be the only way to tell there was a mismatch. Subtitles are deliberately optional in the title matching code since subtitles are not always cataloged constistently. The system relies on having other more reliable metadata to make a match.

By my previous comment, #8903 should mitigate this issue, so closing for now. The root cause is importing ambiguous and not very useful metadata records in the first place. If this problem continues, the fix should be preventing title-only imports completely, and removing existing records if they cannot be matched with disambigutating metadata.

Deleting bad dates (as proposed in #9430) could leave more of these ambiguous records.

internetarchive / openlibrary