internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.22k stars 1.37k forks source link

Alternate script fields (880) not extracted from MARC imports #7264

Closed hornc closed 1 year ago

hornc commented 1 year ago

Describe the problem that you'd like solved

Example MARC record with a publisher in Hebrew (only) in an 880 field: https://openlibrary.org/show-records/harvard_bibliographic_metadata/ab.bib.13.20150123.full.mrc:49430:858

880    $6260-00$aאור יהודה :$bכנרת,$c2011.

Resulting 'publisher unknown' import record: https://openlibrary.org/books/OL43786432M/Zeh_gadol

I think the publication year 2011 (correct) was taken from the leader.

From the linked OCLC record the publisher is listed as: כנרת , which is in 880$b

OL has also missed the publication place, which is present in English in 752$d Or Yehudah (a city in Israel)

From https://www.loc.gov/marc/bibliographic/bd880.html 880 fields can exist without a corresponding linkage field:

When an associated field does not exist in the record, field 880 is constructed as if it did and a reserved occurrence number (00) is used to indicate the special situation.

It looks like OL does not recognise these at all. It's possible there are other issues with 880 alternate script handling, there may be existing open MARC related issues on the same theme.

Proposal & Constraints

Additional context

Stakeholders

tfmorris commented 1 year ago

I was actually just about to file a bug report on the 880 import handling. Let me know if it's close enough to cover here or deserves it's own issue.

I was doing a study of language identification algorithms using OpenLibrary editions with titles and language metadata and found one of them hard particular difficulty with transliterated titles (because it hadn't been trained on transliterated text).

In some examples the MARC record contains the native script, but only the transliterated version is imported (which was the primary cataloging because it was done in the US). It would be valuable to import as much information as possible in the original language/script, especially title and author.

Transliterated fields:

100 1  $6880-01$aPeng, Shiran,$eauthor.
245 10 $6880-02$aYong wo yi sheng, fu ni hua yang nian hua :$bZhou Xuan zhuan /$cPeng Shiran zhu.
246 30 $6880-03$aZhou Xuan zhuan
250    $6880-04$aDi 1 ban.
260    $6880-05$aNanjing shi :$bFeng huang chu ban chuan mei gu fen you xian gong si :$bJiangsu feng huang wen yi chu ban she,$c2015.

Native script fields:

880 1  $6100-01/$1$a彭轼然,$eauthor.
880 10 $6245-02/$1$a用我一生, 赴你花样年华 :$b周璇传 /$c彭轼然 著.
880 30 $6246-03/$1$a周璇传
880    $6250-04/$1$a第1版.
880    $6260-05/$1$a南京市 :$b凤凰出版传媒股份有限公司 :$b江苏凤凰文艺出版社,$c2015.
880 1  $6490-06/$1$a民国. 沉香女人系列
880 14 $6600-07/$1$a周璇,$d1918-1957.
880  0 $6830-08/$1$a民国.$p沉香女人系列.

This affects not only ideographic languages, but Greek, Russian, Arabic, etc

hornc commented 1 year ago

another recent example with Japanese script: https://openlibrary.org/books/OL45552032M/Nihon_no_chasho where only the transliterated title was imported.

I should be able to take a look at improving this in the next week.