internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.26k stars 1.4k forks source link

Import endpoint should allow for any (known) author identifiers #9448

Open Freso opened 5 months ago

Freso commented 5 months ago

Problem

A clear and concise description of what you want to happen

Being able to include any (known to OL) identifiers with authors when importing into Open Library. (Note: This is a superset of https://github.com/internetarchive/openlibrary/issues/9411 which is only concerned with Open Library identifiers. Making a separate issue due to additional considerations (see later section).)

Expected behaviour / screenshots (ex: Figma design screenshots for UI feature)

When generating a JSON blurb for import into OL, it should be possible to provide known author identifiers in it.

Additional Context

When importing into OL you might have a variety of identifiers available that might assist in pinpointing the correct Author (if they exist in OL). E.g., if you import from Amazon, you will have the Amazon author id in addition to the edition ASIN. If you import from LibriVox, you will have the LibriVox author id. Right now it is not possible to provide these to the import pipeline to help with identifying authors, but it could be a great help.

Proposal & Constraints

What is the proposed solution / implementation?

Changing the JSON schema to allow for identifiers for authors, perhaps something like,

diff --git a/olclient/schemata/import.schema.json b/olclient/schemata/import.schema.json
index 3f00e90..8467b76 100644
--- a/olclient/schemata/import.schema.json
+++ b/olclient/schemata/import.schema.json
@@ -109,19 +109,8 @@
       ]
     },
     "identifiers": {
-      "type": "object",
-      "patternProperties": {
-        "^\\w+": { "$ref": "shared_definitions.json#/string_array" }
-      },
-      "description": "Unique identifiers used by external sites to identify a book. Used by Open Library to link offsite.",
-      "examples": [
-        {
-            "standard_ebooks": ["leo-tolstoy/what-is-art/aylmer-maude"]
-        },
-        {
-            "project_gutenberg": ["64317"]
-        }
-      ]
+      "type": "array",
+      "items": { "$ref": "#/definitions/import_identifier" }
     },
     "cover": {
       "type": "string",
@@ -132,6 +121,21 @@
     }
   },
   "definitions": {
+    "import_identifier": {
+      "type": "object",
+      "patternProperties": {
+        "^\\w+": { "$ref": "shared_definitions.json#/string_array" }
+      },
+      "description": "Unique identifiers used by external sites to identify a book, author, or work. Used by Open Library to link offsite.",
+      "examples": [
+        {
+            "standard_ebooks": ["leo-tolstoy/what-is-art/aylmer-maude"]
+        },
+        {
+            "project_gutenberg": ["64317"]
+        }
+      ]
+    },
     "import_author": {
       "type": "object",
       "additionalProperties": false,
@@ -163,6 +167,10 @@
    "title": {
      "type": "string",
      "examples": ["duc d'Otrante"]
+   },
+   "identifiers": {
+     "type": "array",
+     "items": { "$ref": "#/definitions/import_identifier" }
    }
       }
     },

and then of course have the importer pipeline actually recognise the author objects’ identifier(s) and use it for matching against existing OL authors.

Is there a precedent of this approach succeeding elsewhere?

Several MusicBrainz importer scripts use identifiers from import sources to match up identifiers in MusicBrainz. E.g., a-tisket cross-references artist identifiers from iTunes, Deezer, and Spotify with ones known in MusicBrainz to ease the import into MusicBrainz by assigning artists to already existing ones. The Discogs importer userscript does the same, but also does this for Release Groups and Labels.

Granted, the import flow for MusicBrainz is quite different from Open Library, but I think it still shows how being able to look up an import source’s own identifiers can greatly help in matching against the target dataset.

Which suggestions or requirements should be considered for how feature needs to appear or be implemented?

Some considerations:

And I’m sure there are plenty of other edge cases, but this is clearly more involved than just allowing OL ids (https://github.com/internetarchive/openlibrary/issues/9411)

Leads

Related files

Stakeholders

Note: Before making a new branch or updating an existing one, please ensure your branch is up to date.

tfmorris commented 5 days ago

The vast majority of strong identifiers on import come from MARC records. #7724 covers that use case.