Open tfmorris opened 2 weeks ago
Should have included @hornc in the stakeholders - updated.
I think the publishers
field would be supplemented now, but if not, that seems like a straightforward change and I suspect may be uncontroversial.
To clarify on the specific suggestion, @tfmorris, do you mean adding publishers to existing records solely in the case when the re-import source is a MARC record (if they're not already being added more broadly)?
Additionally, with respect to changing the title of an existing record, how do others feel about this? For my part, I think I'm willing to defer to a MARC record when it comes to clobbering other records, for the title
field at least. @seabelis, @hornc, @cdrini?
One approach to limit the blast radius might be to take into account the original source, when it comes to whether a MARC record should clobber a title field, though that may just make things more confusing to work with.
It seems like there's a high error rate with matching these MARC imports to existing records. I'd not be in favour of modifying records based on them (but don't we already do that now?).
I agree that #9808 should make existing record matching considerably better -- that fixed a longstanding issue whereby records were frequently matched just on title only (ignoring subtitle and any other metadata). These matches were made before even attempting the more sophisticated threshold matching code that exists and has tests in the codebase.
publishers
from new records should currently be added to matched existing records if they are blank; (I had to search for the code I thought/hoped existed):
publishers
were added to this list in Feb 2023 in this commit
https://github.com/internetarchive/openlibrary/commit/f6268b647eb8e783e0ef3f1203153a65e64c9c96
, which is after the reported example where publisher wasn't added in Aug 2022: https://openlibrary.org/books/OL10298M/Tiricirapuram_maka%CC%84vittuva%CC%84n%CC%B2_S%CC%81ri%CC%84_Mi%CC%84n%CC%B2a%CC%84t%CC%A3cicuntaram_Pil%CC%A3l%CC%A3ai_avarkal%CC%A3_iyar%CC%B2r%CC%B2iya_Tiru_Amparp_pur?b=5&a=4&_compare=Comparer&m=diff
I believe the code does the correct thing now, but only since 2023, so there will be many examples where it has been missed.
If we wanted to populate missing publishers
:
publishers
records with MARC sources
Problem
When investigating editions records with no publishers for #2119, I noticed cases where the
source_records
lists MARC records which contain publishers in the MARC 260, but it's not being added to the recordThis edition: https://openlibrary.org/books/OL10298M?m=history was imported from a rather threadbare Scriblio MARC record: https://openlibrary.org/show-records/marc_records_scriblio_net/part29.dat:6096809:617 but a later import claims to have used a much richer Columbia MARC https://openlibrary.org/books/OL10298M/Tiricirapuram_maka%CC%84vittuva%CC%84n%CC%B2_..._Amparp_pura%CC%84n%CC%A3am?b=5&a=4&_compare=Comparer&m=diff yet didn't pull in the publisher from there.
Additionally, the original record elided words from the title, so a search on the full title returns zero hits, but I'm not sure there's a good way to detect and correct for that case.
The second example: https://openlibrary.org/books/OL12026877M also can't be found by title, but because it was imported from a threadbare (and incorrect) Amazon record with a typo in it. Despite "importing" from four higher quality MARC records, all containing the correct title and a fully populated MARC 260 Publisher field, neither the missing publisher field nor the incorrect title were updated.
Before trying to guess publishers based on ISBN, the high quality metadata that's already available should be fully exploited.
Reproducing the bug
Context
Breakdown
Requirements Checklist
Related files
*
Stakeholders
Instructions for Contributors