Determine the specifics of author strong identifier matching

pidgezero-one commented 4 days ago

Question

I have an open project about importing books from Wikisource. My import script uses both the Wikidata API as well as the Wikisource API to fetch as much rich information about each book as possible.

While I was developing this script, I learned about the strong identifiers Wikidata offers for authors (like VIAF id, Bookbrainz id, etc). As a proof of concept, I updated my script to include those identifiers in the import records it outputs, and then I modified the import API pipeline to match incoming books to existing authors based on those identifiers. It works, but there's just not much existing data to match to.

Before committing to this change, we should fill out those identifiers for all of OL's existing authors so that the import pipeline can actually use them for matching authors in incoming records. As Wikidata offers that information, and we already know how to get it, we should have a script that can do that backfill.

We should discuss specifics here, such as which IDs (out of this list) should be used for import matching at all (and in which priority) and how to handle conflict resolution. (Also, how do MARC records for authors factor into this?)

Stakeholders

@RayBB @cdrini

RayBB commented 3 days ago

While the choice of identifiers is outside my expertise, let me outline the technical approaches we could take.

The core issue appears to be improving import matching accuracy by leveraging additional identifiers from Wikidata. You think the best way to do that is by importing more identifiers from Wikidata to Open Library author records.

In my opinion, the simplest way to solve your problem is to use the information that is currently stored in the Postgres Wikidata table to match identifiers during import. For all Open Library authors with associated Wikidata IDs, we maintain a copy of their Wikidata information in our PostgreSQL database. You extend the import matching functionality to query these additional identifiers within our existing Wikidata entries in the PostgreSQL database. This approach would provide several advantages: it eliminates data duplication across OL, prevents synchronization issues, reduces potential conflicts, and offers a straightforward implementation path.

However, if we decide to store these strong identifiers directly on the author records, the process can still leverage the existing Wikidata information from our PostgreSQL database to populate these fields.

The approach should be determined through a thorough technical evaluation of both options, weighing their respective implementation challenges implications. I would defer making a recommendation until others can chime in.

Side note: There likely exists a subset of authors whose Open Library IDs are referenced in Wikidata, but whose corresponding Wikidata IDs are not yet recorded in Open Library. While I believe a script has been developed to address this, we'd need to ask to be sure.

Anyway, I'm not deeply familiar with the import system but I'm very excited to see it improve and get better matching!

Freso commented 2 days ago

strong identifiers

FWIW, I really dislike this term, over just using “identifiers”. :) Open Library doesn’t currently have any concept of “strength” of identifiers, and I think it would be a mistake to add it.

In my (subjective, personal!) experience, no identifiers are objectively “stronger” than others. Most “strength” you associated with an identifier either relies on use case… or your subjective experience/bias. E.g., library identifiers are, in my experience, often conflated and/or lacking a lot of entries, like OCLC/VIAF/ISNI are ripe with both duplicates and conflated entities and also don’t have information on a lot of items (either reliable/useful information, or just straight no information at all). In my experience, identifiers that are community maintained/curated (like MBIDs, BBIDs, WD ids) are far more reliable, but all datasets—community or institution managed—has its holes/gaps.

if we decide to store these […] identifiers directly on the author records

My vote is for doing this. I can expand on my arguments/reasoning here or elsewhere, as appropriate. :)

how do MARC records for authors factor into this?

I’m not sure what you mean? If the MARC record has any identifiers in it, we can use those, and if it doesn’t then, well, it doesn’t.

We should discuss specifics here, such as which IDs (out of this list) should be used for import matching at all (and in which priority) and how to handle conflict resolution.

My take is that for project imports (e.g., Wikisource, LibriVox, Gutenberg, Runeberg, …), the identifiers from that project should reign supreme. This might be difficult to code, though, if not impossible. Importers could run their own preliminary matching though to seed their import data with OLIDs (see https://github.com/internetarchive/openlibrary/issues/9411), which should bypass this whole process.

My suggestion for identifier-based import logic flow:

Super-Ideal case:
- incoming data has an OLID ⇒ match to that OLID
Ideal cases (external ids match a single OL entity):
- no overlap in known sets of identifiers ⇒ no match, fallback to “normal” matching
- all known OL ids do not match incoming equivalent ids ⇒ reject match
- any known OL ids match any incoming recognised ids ⇒ match
Troublesome cases (note: ideally, any of these would raise a flag somewhere that a librarian could find for further investigation):
- incoming identifiers match multiple OL entities, pick entity to continue with (for external ids match a single OL entity):
- if matches > 2:
  - if a group of entities match ≥ half of the incoming ids, pick that group and go to incoming identifiers match multiple OL entities with these (note: this is probably the most complex calculation here with a lot of internal edge cases, so, for simplicity, this could be dropped)
  - e.g., A match 4, B match 2, C–H match just 1 each, start over with just A,B
  - if entity A matches ≥ half of the known incoming identifiers, pick entity A
  - e.g., A matches on 3 identifiers and B, C, D match on 1 identifier each
  - note: ≥ should be fine, as there should be no case where A matches half and B matches other half, since then matches > 2 would be false and the flow would be in the matches == 2 tree instead; there might be some edge cases with multiple OL entities having the same identifiers assigned though
  - else no match/fall back to non‐identifier matching
- if matches == 2:
  - if both entities have same amount of matched identifiers, pick oldest (lowest OLID)
  - note: this is far most likely a duplicate that needs merging, and going with the lowest OLID reduces the amount of data that needs updating when the merge is done
  - else pick entity with most matched identifiers
- external ids match a single OL entity:
- half or more of incoming identifiers conflict with known OL ids ⇒ reject match (possibly fall back to normal matching?)
- more incoming identifiers match with known OL ids than conflict ⇒ match

For merging incoming identifier sets with existing sets, I’d say

if identifier is unset in OL, set it
if identifier is different in OL and plural (e.g., oclc_numbers for Editions), append to OL’s list
if identifier is different in OL and singular, keep OL’s version (but flag, if possible)

Of note, this flow has no concept of identifier “strength” and simply tallies up and compares the amount of matching vs. non‐matching identifiers for any given item.

internetarchive / openlibrary

Determine the specifics of author strong identifier matching #10029

Question

Stakeholders