internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.2k stars 1.36k forks source link

Stop creating bad author records from BWB import #7756

Open tfmorris opened 1 year ago

tfmorris commented 1 year ago

In the last 12 months, there has been a huge increase in the creation of corrupted author records, almost always from BWB "promise" or "pallet" imports. The bad records come in a variety of forms including containing multiple author names in a single record, containing roles in the name (e.g. editor, translator), all lower case, etc. Also, since BWB authors apparently never include dates, strong identifiers, or any other type of disambiguating information, a large number of duplicate author records are being created.

Below are two different corrupted forms create for the same pair of authors, both of whom already have existing records in OpenLibrary. Not only do the author records exist, but this exact edition was already cataloged and scanned 15 years ago, but it's impossible to match due to metadata corruption.

Evidence / Screenshot (if possible)

Thaddeus Eddy Samuel; Surber Screen Shot 2023-04-03 at 6 58 15 PM

Eddy, Samuel Surber, Thaddeus, Screen Shot 2023-04-03 at 6 58 26 PM

Relevant url?

https://openlibrary.org/books/OL45868829M https://openlibrary.org/books/OL45991226M

Proposal & Constraints

The BWB importer should be banned from creating new author records until it can do so with a quality on par with those created from MARC records.

New metadata sources should undergo a quality audit before being integrated into the production system.

Stakeholders

@mekarpeles

LeadSongDog commented 1 year ago

@jimchamp Why is this priority 3? These BWB and AMZ imports are clearly a disaster by any reckoning. They should be paused NOW