A clear and concise description of what you want to happen
Being able to include any (known to OL) identifiers with authors when importing into Open Library. (Note: This is a superset of https://github.com/internetarchive/openlibrary/issues/9411 which is only concerned with Open Library identifiers. Making a separate issue due to additional considerations (see later section).)
When generating a JSON blurb for import into OL, it should be possible to provide known author identifiers in it.
Additional Context
When importing into OL you might have a variety of identifiers available that might assist in pinpointing the correct Author (if they exist in OL). E.g., if you import from Amazon, you will have the Amazon author id in addition to the edition ASIN. If you import from LibriVox, you will have the LibriVox author id. Right now it is not possible to provide these to the import pipeline to help with identifying authors, but it could be a great help.
Proposal & Constraints
What is the proposed solution / implementation?
Changing the JSON schema to allow for identifiers for authors, perhaps something like,
diff --git a/olclient/schemata/import.schema.json b/olclient/schemata/import.schema.json
index 3f00e90..8467b76 100644
--- a/olclient/schemata/import.schema.json
+++ b/olclient/schemata/import.schema.json
@@ -109,19 +109,8 @@
]
},
"identifiers": {
- "type": "object",
- "patternProperties": {
- "^\\w+": { "$ref": "shared_definitions.json#/string_array" }
- },
- "description": "Unique identifiers used by external sites to identify a book. Used by Open Library to link offsite.",
- "examples": [
- {
- "standard_ebooks": ["leo-tolstoy/what-is-art/aylmer-maude"]
- },
- {
- "project_gutenberg": ["64317"]
- }
- ]
+ "type": "array",
+ "items": { "$ref": "#/definitions/import_identifier" }
},
"cover": {
"type": "string",
@@ -132,6 +121,21 @@
}
},
"definitions": {
+ "import_identifier": {
+ "type": "object",
+ "patternProperties": {
+ "^\\w+": { "$ref": "shared_definitions.json#/string_array" }
+ },
+ "description": "Unique identifiers used by external sites to identify a book, author, or work. Used by Open Library to link offsite.",
+ "examples": [
+ {
+ "standard_ebooks": ["leo-tolstoy/what-is-art/aylmer-maude"]
+ },
+ {
+ "project_gutenberg": ["64317"]
+ }
+ ]
+ },
"import_author": {
"type": "object",
"additionalProperties": false,
@@ -163,6 +167,10 @@
"title": {
"type": "string",
"examples": ["duc d'Otrante"]
+ },
+ "identifiers": {
+ "type": "array",
+ "items": { "$ref": "#/definitions/import_identifier" }
}
}
},
and then of course have the importer pipeline actually recognise the author objects’ identifier(s) and use it for matching against existing OL authors.
Is there a precedent of this approach succeeding elsewhere?
Several MusicBrainz importer scripts use identifiers from import sources to match up identifiers in MusicBrainz. E.g., a-tisket cross-references artist identifiers from iTunes, Deezer, and Spotify with ones known in MusicBrainz to ease the import into MusicBrainz by assigning artists to already existing ones. The Discogs importer userscript does the same, but also does this for Release Groups and Labels.
Granted, the import flow for MusicBrainz is quite different from Open Library, but I think it still shows how being able to look up an import source’s own identifiers can greatly help in matching against the target dataset.
Which suggestions or requirements should be considered for how feature needs to appear or be implemented?
Some considerations:
What happens when an author otherwise perfectly matches an existing author, but…
the provided identifier isn’t/identifiers aren’t already known?
author gets matched regardless?
and the identifier is/identifiers are discarded
and the identifier is added to the matched author
a new author gets created with the identifier(s) attached?
import fails
provided identifier(s) not known but author already has identifier(s) of the same type(s) (e.g., LibriVox id provided, but matched author already has a different LibriVox id)
provided identifier(s) match(es) a different author
provided identifier(s) match(es) different authors(!)
Identifier(s) match(es) existing Author but any other provided data (name, birth/death dates, …) do not
Problem
A clear and concise description of what you want to happen
Being able to include any (known to OL) identifiers with authors when importing into Open Library. (Note: This is a superset of https://github.com/internetarchive/openlibrary/issues/9411 which is only concerned with Open Library identifiers. Making a separate issue due to additional considerations (see later section).)
Expected behaviour / screenshots (ex: Figma design screenshots for UI feature)
When generating a JSON blurb for import into OL, it should be possible to provide known author identifiers in it.
Additional Context
When importing into OL you might have a variety of identifiers available that might assist in pinpointing the correct Author (if they exist in OL). E.g., if you import from Amazon, you will have the Amazon author id in addition to the edition ASIN. If you import from LibriVox, you will have the LibriVox author id. Right now it is not possible to provide these to the import pipeline to help with identifying authors, but it could be a great help.
Proposal & Constraints
What is the proposed solution / implementation?
Changing the JSON schema to allow for identifiers for authors, perhaps something like,
and then of course have the importer pipeline actually recognise the author objects’ identifier(s) and use it for matching against existing OL authors.
Is there a precedent of this approach succeeding elsewhere?
Several MusicBrainz importer scripts use identifiers from import sources to match up identifiers in MusicBrainz. E.g., a-tisket cross-references artist identifiers from iTunes, Deezer, and Spotify with ones known in MusicBrainz to ease the import into MusicBrainz by assigning artists to already existing ones. The Discogs importer userscript does the same, but also does this for Release Groups and Labels.
Granted, the import flow for MusicBrainz is quite different from Open Library, but I think it still shows how being able to look up an import source’s own identifiers can greatly help in matching against the target dataset.
Which suggestions or requirements should be considered for how feature needs to appear or be implemented?
Some considerations:
And I’m sure there are plenty of other edge cases, but this is clearly more involved than just allowing OL ids (https://github.com/internetarchive/openlibrary/issues/9411)
Leads
Related files
Stakeholders
Note: Before making a new branch or updating an existing one, please ensure your branch is up to date.