internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.22k stars 1.37k forks source link

Amazon author + translator imported as single conflated author #9885

Open tfmorris opened 2 months ago

tfmorris commented 2 months ago

Problem

This author: https://openlibrary.org/authors/OL9912016A.json was imported from Amazon in Nov 2021 with the obviously conflated name of "Rachel Kushner Suat Ertuzun."

While there are thousands upon thousands of conflated author records imported from booksellers BWB (especially) and Amazon, for a wide variety of reasons, this record actually has author and translator listed separately: https://www.amazon.com/gp/product/975072545X where it says "by Rachel Kushner (Author), Suat Ertüzün (Translator)," yet they were imported munged together.

Since this is a nicely specific example, hopefully it will be easy to fix.

Reproducing the bug

  1. Go to the link above
  2. Do ...

Context

Breakdown

Note: this may not be the easiest issue to work on because it isn't trivial to test, as actually running this code relies on running BookWorm/the affiliate server, which can take some work.

Before tackling this issue you'd at least want to understand how serialize() gets data from the Amazon Products API, and how it flows through to clean_amazon_metadata_for_load(), and how you can hardcode your own test data into this process to test the flow between the functions, even if you don't ultimately run BookWorm itself in the affiliate-server container.

Currently, the serialize() function from BookWorm/the affiliate server is returning the following information for https://www.amazon.com/gp/product/975072545X:

{'authors': [{'name': 'Rachel Kushner'}, {'name': 'Suat Ertüzün'}],
  'cover': 'https://m.media-amazon.com/images/I/51gky1d3IWL._SL500_.jpg',
  'edition_num': None,
  'isbn_10': ['975072545X'],
  'isbn_13': ['9789750725456'],
  'number_of_pages': 440,
  'physical_format': 'paperback',
  'price': '$35.00',
  'price_amt': 3500,
  'product_group': 'Book',
  'publish_date': 'Apr 01, 2015',
  'publishers': ['Can Yayınları'],
  'source_records': ['amazon:975072545X'],
  'title': 'Kübadan Teleks',
  'url': 'https://www.amazon.com/dp/975072545X/?tag='}

As noted above, both author and translator have been lumped together as authors.

However, the metadata that comes back from the Amazon Products API includes both, as shown by this excerpt of the by_line_info:

{'by_line_info': {'brand': None,
                  'contributors': [{'locale': 'en_US',
                                    'name': 'Rachel Kushner',
                                    'role': 'Author'},
                                   {'locale': 'en_US',
                                    'name': 'Suat Ertüzün',
                                    'role': 'Translator'}],

We will need to update the serialize() function around line 271 or so in openlibrary/core/vendors.py to both stop importing translator roles as authors, and also create a new key for translators and extract them separately.

This will also require further changes to ensure this ends up in the correct format for import. This would likely be done in clean_amazon_metadata_for_load(), though it may involve other changes.

Specifically, the translator(s) should end up as contributors with a translator role. See import_contributor in https://github.com/internetarchive/openlibrary-client/blob/master/olclient/schemata/import.schema.json and https://openlibrary.org/books/OL24337004M.json (from https://openlibrary.org/books/OL24337004M/The_Odyssey_of_Homer)

As a test case, here's the full metadata that comes back from the Amazon API, which could be used for mocking a response from the API:

{'asin': '975072545X', 'browse_node_info': None, 'detail_page_url': 'https://www.amazon.com/dp/975072545X?tag=interneta
rchi-20&linkCode=ogi&th=1&psc=1', 'images': {'primary': {'large': {'height': 500, 'url': 'https://m.media-amazon.com/im
ages/I/51gky1d3IWL._SL500_.jpg', 'width': 321}, 'medium': None, 'small': None}, 'variants': None}, 'item_info': {'by_li
ne_info': {'brand': None, 'contributors': [{'locale': 'en_US', 'name': 'Rachel Kushner', 'role': 'Author'}, {'locale': 
'en_US', 'name': 'Suat Ertüzün', 'role': 'Translator'}], 'manufacturer': {'display_value': 'Can Yayınları', 'label': 'M
anufacturer', 'locale': 'en_US'}}, 'classifications': {'binding': {'display_value': 'Paperback', 'label': 'Binding', 'l
ocale': 'en_US'}, 'product_group': {'display_value': 'Book', 'label': 'ProductGroup', 'locale': 'en_US'}}, 'content_inf
o': {'edition': None, 'languages': {'display_values': [{'display_value': 'Turkish', 'type': 'Published'}, {'display_val
ue': 'Turkish', 'type': 'Original Language'}, {'display_value': 'Turkish', 'type': 'Unknown'}], 'label': 'Language', 'l
ocale': 'en_US'}, 'pages_count': {'display_value': 440, 'label': 'NumberOfPages', 'locale': 'en_US'}, 'publication_date
': {'display_value': '2015-04-01T00:00:00Z', 'label': 'PublicationDate', 'locale': 'en_US'}}, 'content_rating': None, '
external_ids': None, 'features': None, 'manufacture_info': {'item_part_number': {'display_value': '1', 'label': 'PartNu
mber', 'locale': 'en_US'}, 'model': None, 'warranty': None}, 'product_info': {'color': None, 'is_adult_product': None, 
'item_dimensions': {'height': {'display_value': 7.6771653465, 'label': 'Height', 'locale': 'en_US', 'unit': 'inches'}, 
'length': {'display_value': 0.393700787, 'label': 'Length', 'locale': 'en_US', 'unit': 'inches'}, 'weight': {'display_v
alue': 0.7495716908, 'label': 'Weight', 'locale': 'en_US', 'unit': 'pounds'}, 'width': {'display_value': 4.9212598375, 
'label': 'Width', 'locale': 'en_US', 'unit': 'inches'}}, 'release_date': None, 'size': None, 'unit_count': None}, 'tech
nical_info': None, 'title': {'display_value': 'Kübadan Teleks', 'label': 'Title', 'locale': 'en_US'}, 'trade_in_info': 
None}, 'offers': {'listings': [{'availability': None, 'condition': None, 'delivery_info': None, 'id': 'oDg8%2Fu%2BR%2FL
0WLzvFujN6xVbdeurVK9TOjcknlzQiZlCuRNpUE%2BqIJpTOjvUB2ZhJOrwvkyrZ%2FBQkPOUs9mYR6u4kbYl%2FK%2B4NUving6%2FRYRFFh5eqNUCw%2F
qk8%2F5ms3Jl%2FfP90CHh0Eaxlt1R9eXt%2FApgY%2BhG%2BSHueV%2F32lAWJ%2B2yH1ObOScrdwa3UJMG1AMHb', 'is_buy_box_winner': None, 
'loyalty_points': None, 'merchant_info': None, 'price': {'amount': 35.0, 'currency': 'USD', 'display_amount': '$35.00',
 'price_per_unit': None, 'savings': None}, 'program_eligibility': None, 'promotions': None, 'saving_basis': None, 'viol
ates_map': False}], 'summaries': None}, 'parent_asin': None, 'rental_offers': None, 'score': None, 'variation_attribute
s': None}

Requirements Checklist

Related files

Stakeholders

*


Instructions for Contributors

hornc commented 1 month ago

relates to #3084 That references editors in the title, but the discussion has examples where Amazon also calls out translator and editor roles in the same way.

tfmorris commented 1 month ago

I didn't really highlight the biggest problem here which is that the two individuals that Amazon lists separately with individual roles are being imported as a single author. I've updated the title to better reflect what's going on.

scottbarnes commented 1 month ago

I added some more details on how this issue might be approached, and included sample response data from the Amazon Products API with which to work. This is probably not a trivial issue to tackle and would make a fairly bad first issue.

DebbieSan commented 1 month ago

Hi @scottbarnes I would like to try and tackle this one as well. I think it might be somewhat related to one of the last issues I worked on. Thank you kindly!

scottbarnes commented 1 month ago

Thanks for offering to work on this, @DebbieSan! If you have any questions, please ask.

DebbieSan commented 1 month ago

@scottbarnes will do :) thank you!