Open tfmorris opened 2 months ago
relates to #3084 That references editors in the title, but the discussion has examples where Amazon also calls out translator and editor roles in the same way.
I didn't really highlight the biggest problem here which is that the two individuals that Amazon lists separately with individual roles are being imported as a single author. I've updated the title to better reflect what's going on.
I added some more details on how this issue might be approached, and included sample response data from the Amazon Products API with which to work. This is probably not a trivial issue to tackle and would make a fairly bad first issue.
Hi @scottbarnes I would like to try and tackle this one as well. I think it might be somewhat related to one of the last issues I worked on. Thank you kindly!
Thanks for offering to work on this, @DebbieSan! If you have any questions, please ask.
@scottbarnes will do :) thank you!
Problem
This author: https://openlibrary.org/authors/OL9912016A.json was imported from Amazon in Nov 2021 with the obviously conflated name of "Rachel Kushner Suat Ertuzun."
While there are thousands upon thousands of conflated author records imported from booksellers BWB (especially) and Amazon, for a wide variety of reasons, this record actually has author and translator listed separately: https://www.amazon.com/gp/product/975072545X where it says "by Rachel Kushner (Author), Suat Ertüzün (Translator)," yet they were imported munged together.
Since this is a nicely specific example, hopefully it will be easy to fix.
Reproducing the bug
Context
Breakdown
Note: this may not be the easiest issue to work on because it isn't trivial to test, as actually running this code relies on running BookWorm/the affiliate server, which can take some work.
Before tackling this issue you'd at least want to understand how
serialize()
gets data from the Amazon Products API, and how it flows through toclean_amazon_metadata_for_load()
, and how you can hardcode your own test data into this process to test the flow between the functions, even if you don't ultimately run BookWorm itself in the affiliate-server container.Currently, the
serialize()
function from BookWorm/the affiliate server is returning the following information for https://www.amazon.com/gp/product/975072545X:As noted above, both author and translator have been lumped together as authors.
However, the metadata that comes back from the Amazon Products API includes both, as shown by this excerpt of the
by_line_info
:We will need to update the
serialize()
function around line 271 or so inopenlibrary/core/vendors.py
to both stop importing translator roles as authors, and also create a new key fortranslators
and extract them separately.This will also require further changes to ensure this ends up in the correct format for import. This would likely be done in
clean_amazon_metadata_for_load()
, though it may involve other changes.Specifically, the translator(s) should end up as contributors with a translator role. See
import_contributor
in https://github.com/internetarchive/openlibrary-client/blob/master/olclient/schemata/import.schema.json and https://openlibrary.org/books/OL24337004M.json (from https://openlibrary.org/books/OL24337004M/The_Odyssey_of_Homer)As a test case, here's the full metadata that comes back from the Amazon API, which could be used for mocking a response from the API:
Requirements Checklist
serialize()
inopenlibrary/core/vendors.py
to no longer treat translators as authors.clean_amazon_metadata_for_load()
to handle translators ascontributors
.serialize
and make sure you get the data you want (e.g. with the translator and author properly segmented), then when that works, use the output data fromserialize()
as input in a separate test forclean_amazon_metadata_for_load
. Then you can take whatever that gets, and just make sure it imports properly (see import docs, and we can probably call it a day at that point.Related files
Stakeholders
*
Instructions for Contributors