Closed tfmorris closed 1 year ago
This is a @mekarpeles one. MARC imports are my area, but I am not familiar with the new import queue code and the current BWB imports.
Reliable links to the source data are needed to help improve the import process when data quality issues are discovered. I believe there is an open issue about adding test cases for BWB / import queue imports-- having visibility for real examples will help with choosing relevant cases.
When looking at your screenshots, we noticed that promise:
is in the IA URL. This should probably be removed from the anchor link.
@jimchamp there should also be no mention of MARC records for the promise:
prefix items. The wording and link should be appropriate for the source data used.
It looks like this can be fixed pretty easily, and only requires changing some templates.
First, a new condition should be added to the get_source_record function. If the item
starts with the string "promise:", the source_name
should be something like "Promise item", and the url
should be "//archive.org/details/" plus a substring of item
that excludes the "promise:" prefix.
Once get_source_record
has been updated, adding the new source_name
to this condition should fix everything.
Hi @mekarpeles ! Our team (@yujiezh9 and I) are students from a software engineering course, and both of us have experience in web application development and successfully build and run this project locally. Can we be assignee for this task?
We have modified the code as instructed, and the page now displays the collection records as shown in the attached images. However, we are unsure about what the second anchor link should be if it is a Promise item According to line 17 of comment.html the second link should have a record type. The previous code specified that the record type should be Amazon, bwb, Internet Archive, or MARC. Since we now have a new source name 'Promise item', should we create a new type of record or match it to an existing type? If we use an existing record type, should we still display the second anchor link? Or which link should we use here to display the detailed information about the book? We would appreciate any guidance on this matter.
@hornc and/or @mekarpeles, the second link that @AGoodName244 is referring to goes to the record's /show-record/{record_id}
page. Is there any need for a /show-records
page for promise item imports? I'm thinking that we can omit the second link for promise item imports, but I'm not sure if that would cause any negative consequences.
If we omit the second link, the comment will be something like:
Imported from <a href="https://archive.org/details/{import_batch_name}">Promise item</a>
Based on the discussion, we suggest writing a condition branch to display a custom link for 'Promise items'. This will enable us to display the necessary information for now. If there are any further requirements for 'Promise items' in the future, we can come back and add a new link such as a 'Promise record' as needed.
I'm thinking that we can omit the second link for promise item imports, but I'm not sure if that would cause any negative consequences.
@jimchamp The second link, to the individual metadata record, is the most important one. The first link goes to the collection of records, which might have many thousands of records in it. The individual record is key to following the metadata provenance (ie where poor quality data is coming from).
As far as I know, Open Library is not persisting the individual records for promise items. I'm also noticing that the second links for Amazon and BWB imports ultimately resolve to product pages, which is also problematic.
I'll make sure that we're somehow persisting the raw import data as we improve our import pipeline this year.
Here's the metadata record, such as it is, for that volume:
curl -Lo - https://archive.org/download/bwb_daily_pallets_2020-11-13/DailyPallets__2020-11-13.json | jq -r .[17597]
{
"BookBarcode": "221-AAF-376",
"PackedLocation": "Mishawaka",
"Sort": "Never Seen",
"PalletBarcode": "IA-NS-0000401",
"BookSKUB": "",
"BookSKU": "221-AAF-376",
"ISBN": "B0026RGAWC",
"ASIN": "B0026RGAWC",
"ProductJSON": {
"ISBN": "B0026RGAWC",
"ASIN": "B0026RGAWC",
"Title": "Northern Fishes",
"MasterProductId": "23423092",
"BookId": "89341386",
"Author": "Samuel; Surber, Thaddeus Eddy",
"Publisher": "Charles T. Branford Company",
"PublicationDate": "19600101"
}
}
Obviously for a production solution, you'd want to save the byte offset and length as is done for the MARC records, then use HTTP byte range requests to fetch the data.
As you can see, in this case, the corrupted author (ie "Samuel; Surber, Thaddeus Eddy") and bad publication date (19600101) is in the original metadata provided by BWB.
@jimchamp I'm not sure if you saw my comment above with the recommended solution:
for a production solution, you'd want to save the byte offset and length as is done for the MARC records, then use HTTP byte range requests to fetch the data.
Removing the link isn't really what I would consider a "fix." The point of the feature is to provide users with 1-click access to the source metadata that was used to create the record.
Editions imported from BetterWorldBooks (BWB) contain two links in their footer, neither of which work, making it very difficult to tell whether it's bad source data or a broken importer which is causing the mountains of garbage data that is being imported.
Evidence / Screenshot (if possible)
Relevant url?
https://openlibrary.org/books/OL45991226M
Steps to Reproduce
Details
Proposal & Constraints
Related files
Stakeholders
@hornc ? or perhaps @mekarpeles