internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.2k stars 1.35k forks source link

BWB imports have invented publication dates #7757

Closed tfmorris closed 1 year ago

tfmorris commented 1 year ago

The number of editions published on January 1st has skyrocketed recently and the apparent cause is either BetterWorldBooks (BWB) metadata or a bug in the BWB importer.

Evidence / Screenshot (if possible)

Screen Shot 2023-04-03 at 6 58 15 PM

Relevant url?

Steps to Reproduce

  1. Go to ... almost any recently imported edition
  2. Do ... check the recorded publication date against the actual date from a reliable source

Proposal & Constraints

Stop importing bad data from BetterWorldBooks (BWB)

Stakeholders

@mekarpeles @hornc

AGoodName244 commented 1 year ago

Hi, @jimchamp! Our team (@yujiezh9 and I) are students from a software engineering course, and both of us have experience in web application development and successfully build and run this project locally. We are wondering whether we can be assigned to this task, thank you!

jimchamp commented 1 year ago

@AGoodName244 this is a data issue that I don't quite have an approach for right now, but will be addressed as we revamp our import pipeline this year. Maybe you'd be interested in #7755, which has a solution outlined? If so, comment on that issue and somebody will assign you.

AGoodName244 commented 1 year ago

@AGoodName244 this is a data issue that I don't quite have an approach for right now, but will be addressed as we revamp our import pipeline this year. Maybe you'd be interested in #7755, which has a solution outlined? If so, comment on that issue and somebody will assign you.

Thank you for your response and suggestion. We will certainly take a look at issue #7755. Before your reply, we conducted some research into the data issue and found some potential clues, although we haven't yet solved it. (Based on development version)

  1. Specifically, we examined the metadata extractor API from Amazon and Better World Books under openlibrary/core/vendor.py. We found that the metadata from Amazon includes the publish_date, while the metadata from Better World Books does not.
  2. Furthermore, we investigated a related book in this issue, and in issue #7756, the book "Northern fishes" mentioned in this issue does have a publish date of January 1st, 1960 according to the Amazon page. However, "Northern fishes, with special reference to the upper Mississippi valley," was also set with a January 1st publish date, despite being abnormal. We discovered that the source from Amazon for this book does not provide an exact publication date.
  3. Upon examining the code in vendors.py, we noticed that the serialize function provides a way to parse the publication_date. Code fbdb33e23f2e2e27fa9412e4b424a6f However, when the date is not exactly accurate (for example, a date with only a year), the parse function sets the month and date with some default numbers. 7b32034f4d8633f4a171d8c9f80689d As a result, we assume that if the program cannot get an exact date from Amazon, the parser will set a default to publish month and date to the book, although we have not yet tested it to prove it. We hope that this investigation is helpful to the task, and we remain eager to contribute to the project. Please let us know if you have any further guidance or feedback.
jimchamp commented 1 year ago

@AGoodName244, the code in /core/vendors.py is unrelated to our importer code. If I recall correctly, vendors.py fetches price information for book pages, and is used to create new editions, if needed, when people visit an /isbn/{isbn} page.

The publish dates were parsed correctly from the source import data in each of the cases that you outlined: Record Title Source Data
OL45991226M Northern Fishes JSON
OL45868829M Northern fishes, with special reference to the upper Mississippi valley JSON
AGoodName244 commented 1 year ago

@AGoodName244, the code in /core/vendors.py is unrelated to our importer code. If I recall correctly, vendors.py fetches price information for book pages, and is used to create new editions, if needed, when people visit an /isbn/{isbn} page.

The publish dates were parsed correctly from the source import data in each of the cases that you outlined:

Record Title Source Data OL45991226M Northern Fishes JSON OL45868829M Northern fishes, with special reference to the upper Mississippi valley JSON

Thank you so much for the clarification on the issue, and apologies for any confusion caused. After examining the JSON files, we have some concerns about the data that we are curious about. It appears that there may be some discrepancies or mismatches in the data. For the JSON file of Record Title Source Data
OL45868829M Northern fishes, with special reference to the upper Mississippi valley JSON

The book "Managerial Epidemiology" (ASIN: 076373165X) showed a Publication date of 20050101, while the website displays it as May 1, 2005. Additionally, I appreciate your suggestion to focus on issue #7755, and we will certainly look into it. We would be happy to continue our contribution. Thank you again for your support.

mekarpeles commented 1 year ago

For these cases, can we: Add logic to (explicitly only) BWB importer, ignore 01-01 (and just import the year).

If the years are wrong then the recourse we have is for human/librarians to fix it.

If someone wants to help, the relevant code will be in

  1. https://github.com/internetarchive/openlibrary/blob/master/scripts/promise_batch_imports.py (definitely) and
  2. https://github.com/internetarchive/openlibrary/blob/master/scripts/partner_batch_imports.py (optionally)
LeadSongDog commented 1 year ago

@mekarpeles Surely that isn’t scalable. We don’t have that many human contributors. We know those -01-01 dates are nearly always bogus: what publisher works New Year’s Day? Just fix ImportBot.

seabelis commented 1 year ago

I would not object to importing years only for all cases. 01-01 is just an upstream-enforced date we are importing from elsewhere. Even when a book has an exact date specified (French publishers frequently do this) the correct date is not what we import. Even if the date is not 01-01, the imported exact dates never matches the actual dates specified in the books. Mass-market paperbacks frequently do specify a year and a month. Amazon imports of these are frequently a month off (or sometimes a year off, if it happens to be Dec/Jan). I suspect many of these seemingly arbitrary dates have to do with either the date the item was added to Amazon or the date it went on sale. Neither of these are relevant to us as we are not aiming to be a mirror of Amazon.

mekarpeles commented 1 year ago

Discussed with @judec -- likely an upstream problem with dates coming in as 01-01.

Both of these seem like useful places to investigate (BWB monthly imports and promise item imports):