Closed adolski closed 9 months ago
The current link you have is broken unfortunately.
I found this API with Hathitrust, but it doesn't seem to have book cover images as an available endpoint.
An alternative one we could use is: https://openlibrary.org/dev/docs/api/covers
Here is an example (using Postman) of an API call on one of the books we have, using OCLC number as the query param. It allows for various sizes (S,M,L). The one I used for this example is medium sized.
Yeah, I guess OpenLibrary would be the way to go.
This issue has several different aspects. I guess it could work something like:
Thank you for the breakdown of how to approach this!
Also, I was a little confused by what they meant with "Do not crawl our API". Isn't consuming an API already different than crawling/web-scraping data based on html structure?
When you have some kind of result set, and are iterating through it in order to do something on/with each item, you could be said to be crawling through it. It's an informal term without a precise definition, but I think they just want to prevent users from spamming their API with tons of similar requests.
Yeah, I guess OpenLibrary would be the way to go.
This issue has several different aspects. I guess it could work something like:
Add an S3 bucket to store the images in
- Add it to terraform scripts
- Add it to Book Tracker configuration
- Add a cover filename column to the books table
Add a rake task to iterate through all books in the database that have a null cover filename copy a cover (if available) into the bucket (we want the large one)
- OpenLibrary says not to crawl their API, so we should use the bulk download on archive.org that they suggest instead
- Add a button in the UI to launch an ECS task to run the rake task
- Expose the images on show-book pages and in book JSON representations
I deleted the new S3 bucket I created, so we can just use the existing book-tracker demo and prod buckets to store the images in.
@adolski Some clarifying questions as I work through this issue -
2. Add a cover filename column to the books table
I interpreted this as running a migration
to add a column for cover filename (as a string) and adding an index on books. Is that correct ? and do I need to update the actual UI of the books index
?
Here's the migration:
class AddCoverFilenameToBooks < ActiveRecord::Migration[7.1]
def change
add_column :books, :cover_filename, :string
add_index :books, :cover_filename
end
end
3. Add a rake task to iterate through all books...
Should this rake task live inside the existing books.rake
file, or should it be inside its own rake file, like -/tasks/download_covers.rake
?
I interpreted this as
running a migration
to add a column for cover filename (as a string) and adding an index on books. Is that correct ? and do I need to update the actual UI of thebooks index
?Here's the migration:
class AddCoverFilenameToBooks < ActiveRecord::Migration[7.1] def change add_column :books, :cover_filename, :string add_index :books, :cover_filename end end
The index isn't necessary because there won't be a need to find books based on their cover filename. But the column is correct.
I think it would be useful to update the UI of the show-book page to display the cover in an <img>
tag, for books that have one.
3. Add a rake task to iterate through all books...
Should this rake task live inside the existing
books.rake
file, or should it be inside its own rake file, like -/tasks/download_covers.rake
?
I think it can go in books.rake
. Normally rake files are nouns and the tasks are verbs.
I'm unsure of how exactly the ecs task to run this rake task should be set up at the controller level.
Here is how I've set up things so far:
Inside books.rake
:
desc 'Iterates through books to make request to open library api and download/store image to s3 bucket.'
task :download_book_covers, [:task_id] => :environment do
require 'net/http'
s3 = Aws::S3::Client.new(
access_key_id: Configuration.instance.storage[:books][:access_key_id],
secret_access_key: Configuration.instance.storage[:books][:secret_access_key],
region: Configuration.instance.storage[:books][:region]
)
# iterate through each book in the db, call on open library uri with oclc number
Book.all.each do |book|
uri = URI("http://covers.openlibrary.org/b/oclc/#{book.oclc_number}-L.jpg")
response = Net::HTTP.get_response(uri)
# upload image file as response body to correct s3 bucket
s3.put_object(
bucket: Configuration.instance.storage[:books][:bucket],
key: "book_covers/#{book.oclc_number}.jpg",
body: response.body
)
end
end
end
inside /views/tasks/index.html.haml
:
%table.table
%td Download Book Covers
%td
%td
= form_tag(download_path, method: 'post') do
= submit_tag('Download', class: 'btn btn-primary btn-sm')
routes:
match 'download', to: 'tasks#download', via: :post
Inside the Tasks Controller I have a 'download' method/action to launch an ECS task to run the rake task. I'm looking through the official AWS docs on the ECS API. Do I essentially need to call on something similar to this?
resp = client.run_task({
cluster: "default",
task_definition: "sleep360:1",
})
Your download_book_covers
task is basically the right idea. But, I've looked into the OpenLibrary covers some more. Further down in their documentation they say that their API is rate-limited to 100 requests every 5 minutes. For our 853,309 books, this will result in a minimum of
(((853309/100)*5)/60)/24 = 29.6 days to get everything.
If we were to instead use the zip files on archive.org, there are 390 GB of them that we would have to download, which I guess we could still do, but...
In light of that, I think there is a better way to get these covers, which is to do it on-demand at the point that they are being viewed.
On the show-book page, there could be an img
element for the book cover like so: <img src="http://covers.openlibrary.org/b/oclc/#{book.oclc_number}-L.jpg" alt="Book cover">
Building that out more, and noting that not all books will have covers, and we don't want to display a broken image for those, we could use the suggestion in the Cover Size & API Access section to only render an img
if a cover is present.
The ultimate goal of this issue is to get covers showing up in the Search Gateway. But if we are loading them this way, then all the work will be on that side, and there isn't much more to do on the Book Tracker side.
Great catch, I hadn't noticed the caveat on the rate limit.
Based on what you laid out, here is an example of what a show-book page might look like, for a cover image that exists.
And here is if no cover image exists - there is no broken image link or any other error, so I don't know that we need to even add anything to only render an img
if a cover exists.
I verified the cover image exists for the first one, and not for the second, by appending .json
to the url and making the request in Postman.
If we go with this approach, could I pretty much rollback the previous migration for adding a cover filename, and remove the rake task?
Yes, roll back all that stuff, and let's just go with this. Regarding the location of the image on the page, I'd say just put it somewhere where it looks good and makes sense. It can be smaller or larger too--whatever works.
There are web services that, when given a book identifier, will return a book thumbnail image. One of these is the HathiTrust data API. There may be others as well. We should explore our options and then choose the one that will give us the best coverage.