internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.22k stars 1.37k forks source link

Bad import of authors from IA item #7794

Closed tfmorris closed 1 year ago

tfmorris commented 1 year ago

The author data for this work got split into a whole bunch of nonsense authors (only the first is correct):

It's unclear if the import used the string from Internet Archive ("Lévis, François Gaston, duc de, 1720-1787; Lecestre, Léon, 1861-; Casgrain, H. R. (Henri Raymond), 1831-1904, ed; Québec (Province)") or this MARC data or something else:

100 1  $aLévis, François Gaston,$cduc de,$d1720-1787.
700 1  $aLecestre, Léon,$d1861-
700 1  $aCasgrain, H. R.$q(Henri Raymond),$d1831-1904,$eed.
710 1  $aQuébec (Province)

Evidence / Screenshot (if possible)

Relevant url?

https://openlibrary.org/works/OL19605264W?v=1

Proposal

Only valid authors should be created.

Stakeholders

@hornc

LeadSongDog commented 1 year ago

The source https://archive.org/details/collectiondesma00progoog identifies both OCLCno and LCCN, yet neither were imported.

Is this systemic misbehaviour or an oddity?

hornc commented 1 year ago

This record was imported in 2008 , which makes in one of the earliest imports, before standard processes were developed. The MARC import code has come a long way since then, and is integrated into the OL codebase. In 2008, imports happened on a more ad-hoc basis. (i.e. I'm not sure if that import used the MARC record or just the archive.org metadata, as noted in the original report by @tfmorris)

A fresh new import of the item (to a clean local version of OL) gives one author: "Lévis, François Gaston duc de" and correctly imports the OCLC number and LCCN.

image

The fresh import also picks up table of contents and a fuller description:

image

I re-imported the item to the live site: https://openlibrary.org/books/OL20610518M/Collection_des_manuscrits_du_mar%C3%A9chal_de_L%C3%A9vis?_compare=Compare&b=7&a=6&m=diff and it only picked up the identifiers and LC classification. i.e. an initial MARC import contains substantially more metadata than a re-import over an already poor record.

It also picked an invalid number_of_pages = 12, which was incorrectly inferred from the pagination, which is correct on the fresh import as: 12 v.

hornc commented 1 year ago

Closing as fixed since the specific record has been corrected, and the process that caused this is obsolete. https://openlibrary.org/works/OL19605264W/Collection_des_manuscrits_du_mar%C3%A9chal_de_L%C3%A9vis?v=5 Current imports won't have this problem.