internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.16k stars 1.35k forks source link

Improve Better World Books import quality #6555

Open seabelis opened 2 years ago

seabelis commented 2 years ago

Sometimes an edition imported from Better World Books does not exist on the BWB site or contains different details. We should make sure we are not importing incorrect records.

Evidence / Screenshot (if possible)

Relevant url?

https://openlibrary.org/books/OL34984456M/Chase

Steps to Reproduce

  1. Go to ...https://openlibrary.org/books/OL34984456M/Chase
  2. Do ...Look at import history, follow link to BWB source record, see different details than what was imported.

Details

Proposal & Constraints

Related files

Stakeholders

@mekarpeles

LeadSongDog commented 2 years ago

What’s really missing is a linked local archive of the BWB record at the time it was imported. The same could be said of any imported record: a dead link URI is not enough. There should be no real expectation of them maintaining a record once they’ve sold out their inventory, and if they closed shop tomorrow it should not break anything on OL.

seabelis commented 2 years ago

An example of an import not on BWB. https://openlibrary.org/books/OL37493066M/Seal_of_Armaros

LeadSongDog commented 2 years ago

@seabelis That is a 979 series ISBN, which seem to be a common problem at BWB. Author searches for "Diana Elizabeth" rather unsurprisingly find several matches. Title search on "Soul sentry" finds a different work by a different author.

Attempting to add a Crystal Throne Press 2022 paperback edition 9798985300819 found by a title search for "Seal of Armaros" on AMZ also threw an error:

01CFF3BE-B7C4-4A10-BCD7-4693063772A1

64839C7E-567B-4560-B7D7-155A8CFBD87A

[revised, blushing] Looking closer, it appears I didn’t select the "ISBN" identifier drop down. Still, the diagnostic could have been more instructive…

seabelis commented 2 years ago

I think the error is because the edition already exists. https://openlibrary.org/books/OL36540435M/Seal_of_Armaros It could be a bit more useful in communicating that.

seabelis commented 2 years ago

Another example. https://openlibrary.org/books/OL35098930M/American_Librarian?m=history

seabelis commented 2 years ago

https://openlibrary.org/books/OL36097962M/Grow?m=history

seabelis commented 2 years ago

In general, there are frequent reports by patrons that the records imported from BWB either don't exist or don't exist as imported.

LeadSongDog commented 2 years ago

@seabelis For those last two ISBNs, BWB found both for me today, although the former had a different cover title "Librarian Spy". Perhaps it is a transient problem on their end? In any case the need for an archived record (as seen when imported) still seems pertinent.

seabelis commented 2 years ago

In these cases, the books were imported prior to publication and it seems their details hadn't been finalized. As far as I'm concerned that = "not a book".

LeadSongDog commented 2 years ago

@seabelis Do you think it should be a general rule that c.I.p.=not a book, or would that be a step too far? At what point in the journey from working manuscript to library shelf should the threshold fall?

seabelis commented 1 year ago

@LeadSongDog Given the number of reports I receive, anything prior to publication is not a book.

seabelis commented 1 year ago

Date parsing error? https://openlibrary.org/books/OL27306846M/The_Hammer_of_Thor?v=1 This was mentioned recently by @mheiman

LeadSongDog commented 1 year ago

The source record, https://www.betterworldbooks.com/product/detail/9780141342542 presents a date of "Oct. 4th, 2016" and an isbn10 of "B01E5T8RF6" which is in fact the ASIN for a now-unavailable (or possibly never-was-available) Kindle edition, https://www.amazon.com/Magnus-Chase-Hammer-Thor-Book-ebook/dp/B01E5T8RF6 That ASIN record in turn presents a date of “October 4, 2016” and publisher Puffin. Since I t seems the author now has a deal with Disney/Hyperion, it could be that Puffin no longer has the right to distribute the ebook.

As ImportBot only inhaled the BWB record on September 10, 2019 it seems unlikely to have since changed, but of course we can’t be sure without an archival image of the record as inhaled (vice just an url).

is it possible the date parsing failed on "Oct. 4th, 2016" simply because of the ordinal form “4th”? This appears widely on BWB….

Sorry this is so speculative, just looking for clues/theories.

seabelis commented 1 year ago

Thank you @LeadSongDog. We have a lot of bwb imports with date parsing errors. The date format you've pointed out may be the culprit. Using this issue to keep track of examples.

tfmorris commented 1 year ago

Perhaps it would make sense to establish some minimum quality standards for metadata sources. BWB would almost certainly fail to pass any reasonable bar.

LeadSongDog commented 1 year ago

@tfmorris I think you may be preaching to the choir. Personally I would prefer to only adopt vendor edition records that can be independently confirmed, e.g. in a library catalogue, but the integrated ebook publisher vendors (Amazon, Barnes&Noble, Sony, etc) can’t be expected to meet that test.

seabelis commented 1 year ago

Date: https://openlibrary.org/books/OL43325428M/Point_Counter_Point_by_Aldous_Huxley_(1947-12-06)

seabelis commented 1 year ago

Presumably date parsing errors here. https://openlibrary.org/authors/OL233814A/Enid_Blyton. My IP is blocked from BWB, so I cannot check their data against ours.

seabelis commented 1 year ago

Another example of imported record not matching source. https://openlibrary.org/books/OL28637565M/Murder_Games?v=1

scottbarnes commented 3 months ago

This should be revisited after implementing the solution in #9440, as it may improve the situation outlined here.