internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.22k stars 1.37k forks source link

Don't import MARC 250$6 as part of edition name #7617

Closed tfmorris closed 1 year ago

tfmorris commented 1 year ago

The MARC subfield 250$6 is a linkage field which should not be included as part of the edition name. This is causing a large number of non-English edition names to be corrupted with strings like 880-01.

https://openlibrary.org/books/OL27062719M https://openlibrary.org/show-records/ia:isbn_9787508617725

Similarly the TOC subfield 505$6 is polluting tables of contents with similar text.

https://openlibrary.org/books/OL17217449M/Zhizn%CA%B9_%C4%97to_teatr https://openlibrary.org/show-records/marc_miami_univ_ohio/allbibs0193.out:11791288:963

Proposal & Constraints

I don't know if it's a firm convention, but it seems that subfields which are not intended to be part of the rendered text are numeric while those that are intended to be included are alphabetic, so using get_lower_subfields() might be an appropriate approach.

There are over 840K edition records where the effects of the edition name need to be cleaned up as well.

Stakeholders

@hornc

tfmorris commented 1 year ago

@mekarpeles @hornc Can I assume that one of you will split out the data cleanup task into a separate issue since this one has been closed? There are close to a million editions which need to be fixed.