internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.09k stars 1.33k forks source link

MARC records listed as source records not being used (or used fully?) #9831

Open tfmorris opened 2 weeks ago

tfmorris commented 2 weeks ago

Problem

When investigating editions records with no publishers for #2119, I noticed cases where the source_records lists MARC records which contain publishers in the MARC 260, but it's not being added to the record

This edition: https://openlibrary.org/books/OL10298M?m=history was imported from a rather threadbare Scriblio MARC record: https://openlibrary.org/show-records/marc_records_scriblio_net/part29.dat:6096809:617 but a later import claims to have used a much richer Columbia MARC https://openlibrary.org/books/OL10298M/Tiricirapuram_maka%CC%84vittuva%CC%84n%CC%B2_..._Amparp_pura%CC%84n%CC%A3am?b=5&a=4&_compare=Comparer&m=diff yet didn't pull in the publisher from there.

Additionally, the original record elided words from the title, so a search on the full title returns zero hits, but I'm not sure there's a good way to detect and correct for that case.

The second example: https://openlibrary.org/books/OL12026877M also can't be found by title, but because it was imported from a threadbare (and incorrect) Amazon record with a typo in it. Despite "importing" from four higher quality MARC records, all containing the correct title and a fully populated MARC 260 Publisher field, neither the missing publisher field nor the incorrect title were updated.

Before trying to guess publishers based on ISBN, the high quality metadata that's already available should be fully exploited.

Reproducing the bug

  1. Go to ...
  2. Do ...

Context

Breakdown

Requirements Checklist

Related files

*

Stakeholders

Instructions for Contributors

tfmorris commented 2 weeks ago

Should have included @hornc in the stakeholders - updated.

scottbarnes commented 2 days ago

I think the publishers field would be supplemented now, but if not, that seems like a straightforward change and I suspect may be uncontroversial.

To clarify on the specific suggestion, @tfmorris, do you mean adding publishers to existing records solely in the case when the re-import source is a MARC record (if they're not already being added more broadly)?

Additionally, with respect to changing the title of an existing record, how do others feel about this? For my part, I think I'm willing to defer to a MARC record when it comes to clobbering other records, for the title field at least. @seabelis, @hornc, @cdrini?

One approach to limit the blast radius might be to take into account the original source, when it comes to whether a MARC record should clobber a title field, though that may just make things more confusing to work with.

seabelis commented 2 days ago

It seems like there's a high error rate with matching these MARC imports to existing records. I'd not be in favour of modifying records based on them (but don't we already do that now?).

scottbarnes commented 2 days ago

9808 may be responsive to the matching issue, but perhaps not. I am unsure the full extent of it.

hornc commented 8 hours ago

I agree that #9808 should make existing record matching considerably better -- that fixed a longstanding issue whereby records were frequently matched just on title only (ignoring subtitle and any other metadata). These matches were made before even attempting the more sophisticated threshold matching code that exists and has tests in the codebase.

publishers from new records should currently be added to matched existing records if they are blank; (I had to search for the code I thought/hoped existed):

https://github.com/internetarchive/openlibrary/blob/f64cab54045351216cd22b961691d7946ecc0a14/openlibrary/catalog/add_book/__init__.py#L834-L840

publishers were added to this list in Feb 2023 in this commit https://github.com/internetarchive/openlibrary/commit/f6268b647eb8e783e0ef3f1203153a65e64c9c96

, which is after the reported example where publisher wasn't added in Aug 2022: https://openlibrary.org/books/OL10298M/Tiricirapuram_maka%CC%84vittuva%CC%84n%CC%B2_S%CC%81ri%CC%84_Mi%CC%84n%CC%B2a%CC%84t%CC%A3cicuntaram_Pil%CC%A3l%CC%A3ai_avarkal%CC%A3_iyar%CC%B2r%CC%B2iya_Tiru_Amparp_pur?b=5&a=4&_compare=Comparer&m=diff

I believe the code does the correct thing now, but only since 2023, so there will be many examples where it has been missed.

If we wanted to populate missing publishers :