Import endpoint should allow for any (known) author identifiers

Problem

A clear and concise description of what you want to happen

Being able to include any (known to OL) identifiers with authors when importing into Open Library. (Note: This is a superset of https://github.com/internetarchive/openlibrary/issues/9411 which is only concerned with Open Library identifiers. Making a separate issue due to additional considerations (see later section).)

Expected behaviour / screenshots (ex: Figma design screenshots for UI feature)

When generating a JSON blurb for import into OL, it should be possible to provide known author identifiers in it.

Additional Context

When importing into OL you might have a variety of identifiers available that might assist in pinpointing the correct Author (if they exist in OL). E.g., if you import from Amazon, you will have the Amazon author id in addition to the edition ASIN. If you import from LibriVox, you will have the LibriVox author id. Right now it is not possible to provide these to the import pipeline to help with identifying authors, but it could be a great help.

Proposal & Constraints

What is the proposed solution / implementation?

Changing the JSON schema to allow for identifiers for authors, perhaps something like,

diff --git a/olclient/schemata/import.schema.json b/olclient/schemata/import.schema.json
index 3f00e90..8467b76 100644
--- a/olclient/schemata/import.schema.json
+++ b/olclient/schemata/import.schema.json
@@ -109,19 +109,8 @@
       ]
     },
     "identifiers": {
-      "type": "object",
-      "patternProperties": {
-        "^\\w+": { "$ref": "shared_definitions.json#/string_array" }
-      },
-      "description": "Unique identifiers used by external sites to identify a book. Used by Open Library to link offsite.",
-      "examples": [
-        {
-            "standard_ebooks": ["leo-tolstoy/what-is-art/aylmer-maude"]
-        },
-        {
-            "project_gutenberg": ["64317"]
-        }
-      ]
+      "type": "array",
+      "items": { "$ref": "#/definitions/import_identifier" }
     },
     "cover": {
       "type": "string",
@@ -132,6 +121,21 @@
     }
   },
   "definitions": {
+    "import_identifier": {
+      "type": "object",
+      "patternProperties": {
+        "^\\w+": { "$ref": "shared_definitions.json#/string_array" }
+      },
+      "description": "Unique identifiers used by external sites to identify a book, author, or work. Used by Open Library to link offsite.",
+      "examples": [
+        {
+            "standard_ebooks": ["leo-tolstoy/what-is-art/aylmer-maude"]
+        },
+        {
+            "project_gutenberg": ["64317"]
+        }
+      ]
+    },
     "import_author": {
       "type": "object",
       "additionalProperties": false,
@@ -163,6 +167,10 @@
    "title": {
      "type": "string",
      "examples": ["duc d'Otrante"]
+   },
+   "identifiers": {
+     "type": "array",
+     "items": { "$ref": "#/definitions/import_identifier" }
    }
       }
     },

and then of course have the importer pipeline actually recognise the author objects’ identifier(s) and use it for matching against existing OL authors.

Is there a precedent of this approach succeeding elsewhere?

Several MusicBrainz importer scripts use identifiers from import sources to match up identifiers in MusicBrainz. E.g., a-tisket cross-references artist identifiers from iTunes, Deezer, and Spotify with ones known in MusicBrainz to ease the import into MusicBrainz by assigning artists to already existing ones. The Discogs importer userscript does the same, but also does this for Release Groups and Labels.

Granted, the import flow for MusicBrainz is quite different from Open Library, but I think it still shows how being able to look up an import source’s own identifiers can greatly help in matching against the target dataset.

Which suggestions or requirements should be considered for how feature needs to appear or be implemented?

Some considerations:

What happens when an author otherwise perfectly matches an existing author, but…
- the provided identifier isn’t/identifiers aren’t already known?
- author gets matched regardless?
  - and the identifier is/identifiers are discarded
  - and the identifier is added to the matched author
- a new author gets created with the identifier(s) attached?
- import fails
- provided identifier(s) not known but author already has identifier(s) of the same type(s) (e.g., LibriVox id provided, but matched author already has a different LibriVox id)
- provided identifier(s) match(es) a different author
- provided identifier(s) match(es) different authors(!)
Identifier(s) match(es) existing Author but any other provided data (name, birth/death dates, …) do not

And I’m sure there are plenty of other edge cases, but this is clearly more involved than just allowing OL ids (https://github.com/internetarchive/openlibrary/issues/9411)

Leads

Related files

Stakeholders

Note: Before making a new branch or updating an existing one, please ensure your branch is up to date.

internetarchive / openlibrary