internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.11k stars 1.34k forks source link

BWB import provenance footer is broken #7755

Closed tfmorris closed 1 year ago

tfmorris commented 1 year ago

Editions imported from BetterWorldBooks (BWB) contain two links in their footer, neither of which work, making it very difficult to tell whether it's bad source data or a broken importer which is causing the mountains of garbage data that is being imported.

Evidence / Screenshot (if possible)

Screen Shot 2023-04-03 at 6 47 39 PM

Screen Shot 2023-04-03 at 6 48 09 PM

Relevant url?

https://openlibrary.org/books/OL45991226M

Steps to Reproduce

  1. Go to ... the URL above
  2. Do ... click on the links at the bottom of the page

Details

Proposal & Constraints

Related files

Stakeholders

@hornc ? or perhaps @mekarpeles

hornc commented 1 year ago

This is a @mekarpeles one. MARC imports are my area, but I am not familiar with the new import queue code and the current BWB imports.

Reliable links to the source data are needed to help improve the import process when data quality issues are discovered. I believe there is an open issue about adding test cases for BWB / import queue imports-- having visibility for real examples will help with choosing relevant cases.

jimchamp commented 1 year ago

When looking at your screenshots, we noticed that promise: is in the IA URL. This should probably be removed from the anchor link.

hornc commented 1 year ago

@jimchamp there should also be no mention of MARC records for the promise: prefix items. The wording and link should be appropriate for the source data used.

jimchamp commented 1 year ago

It looks like this can be fixed pretty easily, and only requires changing some templates.

First, a new condition should be added to the get_source_record function. If the item starts with the string "promise:", the source_name should be something like "Promise item", and the url should be "//archive.org/details/" plus a substring of item that excludes the "promise:" prefix.

Once get_source_record has been updated, adding the new source_name to this condition should fix everything.

AGoodName244 commented 1 year ago

Hi @mekarpeles ! Our team (@yujiezh9 and I) are students from a software engineering course, and both of us have experience in web application development and successfully build and run this project locally. Can we be assignee for this task?

AGoodName244 commented 1 year ago

We have modified the code as instructed, and the page now displays the collection records as shown in the attached images. eb6e47daae645e70167a040a43147d6 53b714fbb317fa8aeaf8ca72ee2d6f8 However, we are unsure about what the second anchor link should be if it is a Promise item According to line 17 of comment.html the second link should have a record type. The previous code specified that the record type should be Amazon, bwb, Internet Archive, or MARC. Since we now have a new source name 'Promise item', should we create a new type of record or match it to an existing type? If we use an existing record type, should we still display the second anchor link? Or which link should we use here to display the detailed information about the book? We would appreciate any guidance on this matter.

jimchamp commented 1 year ago

@hornc and/or @mekarpeles, the second link that @AGoodName244 is referring to goes to the record's /show-record/{record_id} page. Is there any need for a /show-records page for promise item imports? I'm thinking that we can omit the second link for promise item imports, but I'm not sure if that would cause any negative consequences.

If we omit the second link, the comment will be something like: Imported from <a href="https://archive.org/details/{import_batch_name}">Promise item</a>

AGoodName244 commented 1 year ago

Based on the discussion, we suggest writing a condition branch to display a custom link for 'Promise items'. This will enable us to display the necessary information for now. If there are any further requirements for 'Promise items' in the future, we can come back and add a new link such as a 'Promise record' as needed.

tfmorris commented 1 year ago

I'm thinking that we can omit the second link for promise item imports, but I'm not sure if that would cause any negative consequences.

@jimchamp The second link, to the individual metadata record, is the most important one. The first link goes to the collection of records, which might have many thousands of records in it. The individual record is key to following the metadata provenance (ie where poor quality data is coming from).

jimchamp commented 1 year ago

As far as I know, Open Library is not persisting the individual records for promise items. I'm also noticing that the second links for Amazon and BWB imports ultimately resolve to product pages, which is also problematic.

I'll make sure that we're somehow persisting the raw import data as we improve our import pipeline this year.

tfmorris commented 1 year ago

Here's the metadata record, such as it is, for that volume:

curl -Lo - https://archive.org/download/bwb_daily_pallets_2020-11-13/DailyPallets__2020-11-13.json | jq -r .[17597]

{
  "BookBarcode": "221-AAF-376",
  "PackedLocation": "Mishawaka",
  "Sort": "Never Seen",
  "PalletBarcode": "IA-NS-0000401",
  "BookSKUB": "",
  "BookSKU": "221-AAF-376",
  "ISBN": "B0026RGAWC",
  "ASIN": "B0026RGAWC",
  "ProductJSON": {
    "ISBN": "B0026RGAWC",
    "ASIN": "B0026RGAWC",
    "Title": "Northern Fishes",
    "MasterProductId": "23423092",
    "BookId": "89341386",
    "Author": "Samuel; Surber, Thaddeus Eddy",
    "Publisher": "Charles T. Branford Company",
    "PublicationDate": "19600101"
  }
}

Obviously for a production solution, you'd want to save the byte offset and length as is done for the MARC records, then use HTTP byte range requests to fetch the data.

As you can see, in this case, the corrupted author (ie "Samuel; Surber, Thaddeus Eddy") and bad publication date (19600101) is in the original metadata provided by BWB.

tfmorris commented 1 year ago

@jimchamp I'm not sure if you saw my comment above with the recommended solution:

for a production solution, you'd want to save the byte offset and length as is done for the MARC records, then use HTTP byte range requests to fetch the data.

Removing the link isn't really what I would consider a "fix." The point of the feature is to provide users with 1-click access to the source metadata that was used to create the record.