internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.22k stars 1.37k forks source link

Importer doesn't handle MARC 245$n volume/part subfield #701

Closed tfmorris closed 1 year ago

tfmorris commented 6 years ago

These three imported Talis MARC records all end up with the same title, without any of the disambiguating information that the original record contains:

245 10 $aGuide to open learning materials.$nBooklet no. 3 & 4,$pAccountancy and book keeping. - MARC, OpenLibrary 245 10 $aGuide to open learning materials.$nBooklet no. 7,$pMotor vehicle engineering. - MARC, OpenLibrary 245 10 $aGuide to open learning materials.$nBooklet no. 13,$pChemistry. - MARC, OpenLibrary

We should figure out a way to include this information in the OpenLibrary record so that it can be used, even if it just means appending it to the title.

mekarpeles commented 6 years ago

cc: @hornc re: MARC ImportBot

hornc commented 5 years ago

The relevant code is here: https://github.com/internetarchive/openlibrary/blob/59f190c2017ae5c9b1ecc53f01c3ecbe304212e4/openlibrary/catalog/marc/parse.py#L175

Looks like $p "Name of part/section of a work" is considered, but not $n "Number of part/section of a work"

There does not appear to be any current test MARC records that have a 245$n to test against.

MARC title subfields reference: https://www.loc.gov/marc/bibliographic/bd245.html

tfmorris commented 5 years ago

Here's another example: https://openlibrary.org/books/OL26605397M All 400+ volumes of the National Union Catalog are missing the most obvious distinguishing feature, the volume number.

This should be straightforward for a first timer who wants to focus on backend / import stuff to address. A failing test needs to be written and then the 245$p code pattern can be repeated for 245$n.

mekarpeles commented 5 years ago

@hornc Is this actually a Good First Issue??

tfmorris commented 5 years ago

I've updated @hornc's code reference to a permalink. The relevant MARC documentation is:

$a - Title $b - Remainder of title [...] $n - Number of part/section of a work (R) $p - Name of part/section of a work (R)

which is formatted for OpenLibrary as $a[ : $b][ : $p]. I propose that we extend this to $a[ : $b][ : $n][ : $p] since MARC subfields tend to be named in the logical order of presentation.

The MARC records in the original issue description (e.g. https://openlibrary.org/show-records/talis_openlibrary_contribution/talis-openlibrary-contribution.mrc:3026825531:453) can be used to create the necessary unit test.

If additional specification is needed to make this easy enough for a first timer, let's add it here.

mekarpeles commented 4 years ago

I'm going to go out on a limb and say this is probably not a good first issue.

tfmorris commented 4 years ago

What additional detail would you like to see to make it a good first issue?

tfmorris commented 1 year ago

Kind of silly that there's 5 more years of bad data when the fix is (was) so trivial, but I've put up a PR for review which resolves the issue.