Book thumbnail images - Githubissues

adolski commented 1 year ago

There are web services that, when given a book identifier, will return a book thumbnail image. One of these is the HathiTrust data API. There may be others as well. We should explore our options and then choose the one that will give us the best coverage.

gaurijo commented 10 months ago

The current link you have is broken unfortunately.

I found this API with Hathitrust, but it doesn't seem to have book cover images as an available endpoint.

An alternative one we could use is: https://openlibrary.org/dev/docs/api/covers

Here is an example (using Postman) of an API call on one of the books we have, using OCLC number as the query param. It allows for various sizes (S,M,L). The one I used for this example is medium sized.

adolski commented 10 months ago

Yeah, I guess OpenLibrary would be the way to go.

This issue has several different aspects. I guess it could work something like:

Add an S3 bucket to store the images in
- Add it to terraform scripts
- Add it to Book Tracker configuration
Add a cover filename column to the books table
Add a rake task to iterate through all books in the database that have a null cover filename copy a cover (if available) into the bucket (we want the large one)
- OpenLibrary says not to crawl their API, so we should use the bulk download on archive.org that they suggest instead
Add a button in the UI to launch an ECS task to run the rake task
Expose the images on show-book pages and in book JSON representations

gaurijo commented 10 months ago

Thank you for the breakdown of how to approach this!

Also, I was a little confused by what they meant with "Do not crawl our API". Isn't consuming an API already different than crawling/web-scraping data based on html structure?

adolski commented 10 months ago

When you have some kind of result set, and are iterating through it in order to do something on/with each item, you could be said to be crawling through it. It's an informal term without a precise definition, but I think they just want to prevent users from spamming their API with tons of similar requests.

gaurijo commented 10 months ago

Yeah, I guess OpenLibrary would be the way to go.

This issue has several different aspects. I guess it could work something like:

Add an S3 bucket to store the images in

Add it to terraform scripts

Add it to Book Tracker configuration

Add a cover filename column to the books table

Add a rake task to iterate through all books in the database that have a null cover filename copy a cover (if available) into the bucket (we want the large one)

OpenLibrary says not to crawl their API, so we should use the bulk download on archive.org that they suggest instead

Add a button in the UI to launch an ECS task to run the rake task

Expose the images on show-book pages and in book JSON representations

I deleted the new S3 bucket I created, so we can just use the existing book-tracker demo and prod buckets to store the images in.

gaurijo commented 10 months ago

@adolski Some clarifying questions as I work through this issue -

2. Add a cover filename column to the books table

I interpreted this as running a migration to add a column for cover filename (as a string) and adding an index on books. Is that correct ? and do I need to update the actual UI of the books index?

Here's the migration:

class AddCoverFilenameToBooks < ActiveRecord::Migration[7.1]
  def change
    add_column :books, :cover_filename, :string
    add_index :books, :cover_filename
  end
end

3. Add a rake task to iterate through all books...

Should this rake task live inside the existing books.rake file, or should it be inside its own rake file, like -/tasks/download_covers.rake?

adolski commented 10 months ago

I interpreted this as running a migration to add a column for cover filename (as a string) and adding an index on books. Is that correct ? and do I need to update the actual UI of the books index?

Here's the migration:
class AddCoverFilenameToBooks < ActiveRecord::Migration[7.1]
  def change
    add_column :books, :cover_filename, :string
    add_index :books, :cover_filename
  end
end

The index isn't necessary because there won't be a need to find books based on their cover filename. But the column is correct.

I think it would be useful to update the UI of the show-book page to display the cover in an <img> tag, for books that have one.

3. Add a rake task to iterate through all books...

Should this rake task live inside the existing books.rake file, or should it be inside its own rake file, like -/tasks/download_covers.rake?

I think it can go in books.rake. Normally rake files are nouns and the tasks are verbs.

gaurijo commented 10 months ago

I'm unsure of how exactly the ecs task to run this rake task should be set up at the controller level.

Here is how I've set up things so far:

Inside books.rake:

desc 'Iterates through books to make request to open library api and download/store image to s3 bucket.'
  task :download_book_covers, [:task_id] => :environment do 
    require 'net/http'

    s3 = Aws::S3::Client.new(
      access_key_id: Configuration.instance.storage[:books][:access_key_id],
      secret_access_key: Configuration.instance.storage[:books][:secret_access_key],
      region: Configuration.instance.storage[:books][:region]
    )
    # iterate through each book in the db, call on open library uri with oclc number

    Book.all.each do |book|
      uri = URI("http://covers.openlibrary.org/b/oclc/#{book.oclc_number}-L.jpg")
      response = Net::HTTP.get_response(uri)
      # upload image file as response body to correct s3 bucket

      s3.put_object(
        bucket: Configuration.instance.storage[:books][:bucket],
        key: "book_covers/#{book.oclc_number}.jpg",
        body: response.body 
      )
    end
  end
end

inside /views/tasks/index.html.haml:

%table.table
  %td Download Book Covers
  %td 
  %td
    = form_tag(download_path, method: 'post') do 
      = submit_tag('Download', class: 'btn btn-primary btn-sm')

routes:

match 'download', to: 'tasks#download', via: :post

Inside the Tasks Controller I have a 'download' method/action to launch an ECS task to run the rake task. I'm looking through the official AWS docs on the ECS API. Do I essentially need to call on something similar to this?

resp = client.run_task({
  cluster: "default", 
  task_definition: "sleep360:1", 
})

adolski commented 10 months ago

Your download_book_covers task is basically the right idea. But, I've looked into the OpenLibrary covers some more. Further down in their documentation they say that their API is rate-limited to 100 requests every 5 minutes. For our 853,309 books, this will result in a minimum of (((853309/100)*5)/60)/24 = 29.6 days to get everything.

If we were to instead use the zip files on archive.org, there are 390 GB of them that we would have to download, which I guess we could still do, but...

In light of that, I think there is a better way to get these covers, which is to do it on-demand at the point that they are being viewed.

On the show-book page, there could be an img element for the book cover like so: <img src="http://covers.openlibrary.org/b/oclc/#{book.oclc_number}-L.jpg" alt="Book cover">

Building that out more, and noting that not all books will have covers, and we don't want to display a broken image for those, we could use the suggestion in the Cover Size & API Access section to only render an img if a cover is present.

The ultimate goal of this issue is to get covers showing up in the Search Gateway. But if we are loading them this way, then all the work will be on that side, and there isn't much more to do on the Book Tracker side.

gaurijo commented 10 months ago

Great catch, I hadn't noticed the caveat on the rate limit.

Based on what you laid out, here is an example of what a show-book page might look like, for a cover image that exists.

And here is if no cover image exists - there is no broken image link or any other error, so I don't know that we need to even add anything to only render an img if a cover exists.

I verified the cover image exists for the first one, and not for the second, by appending .json to the url and making the request in Postman.

If we go with this approach, could I pretty much rollback the previous migration for adding a cover filename, and remove the rake task?

adolski commented 10 months ago

Yes, roll back all that stuff, and let's just go with this. Regarding the location of the image on the page, I'd say just put it somewhere where it looks good and makes sense. It can be smaller or larger too--whatever works.

medusa-project / book-tracker

Book thumbnail images #11