medusa-project / book-tracker

Medusa Book Tracker
0 stars 0 forks source link

Automate the Google Books check #5

Closed adolski closed 8 months ago

adolski commented 1 year ago

In the past, Google Books' administrative console used IP address whitelisting for access control, and we had the Book Tracker server's IP address whitelisted so that it could programmatically fetch a list of our books from Google without any authentication. Later, Google Books changed to use the main Google authentication system, so I had to revise this system to require users to manually download the inventory file from Google and then upload it to the Book Tracker. We would like to re-automate this process, if possible.

Note that the part of Google Books we need to interact with is not available via the main Google API, which uses OAuth. So, this may require using some kind of tool like Selenium that simulates a browser.

(Looping in @henryborchers)

gaurijo commented 11 months ago

@adolski should this issue be worked on concurrently with issue 6 ?

adolski commented 11 months ago

@gaurijo This is more of a prerequisite for #6.

gaurijo commented 11 months ago

This is my high-level understanding of this issue (please let me know if I'm missing important context or made a wrong assumption):

essentially we want to eliminate the need for the user to have to download the google books text-file and re-upload it to Book Tracker. That can hopefully be achieved with an automation tool like selenium to mimic how it would actually work in the browser (I've used selenium once on a side project, so I'm familiar with it at a baseline level)

Image

I'm not able to access the google books path right now - could I get permissions granted for that?

Image

adolski commented 11 months ago

I can't get in either, apparently. I think I last used the Google Books backend about five years ago. I don't even remember offhand how to gain access. I will look into this some more and keep you posted.

But your understanding is basically correct: we want the Book Tracker to be able to download the inventory file itself without a human having to log in and download it and then upload it to the Book Tracker. We want the Google check to work like the HathiTrust check in that regard, not requiring any user intervention. Selenium might be a way to enable the Book Tracker to "click through" the Google login process, or there might be a better way.

I think I have already tried Google's OAuth API and I remember that it wasn't enabled for the Google Books backend. (GB has a special, non-public backend that only partner institutions have access to.) But that was a long time ago and things might have changed since then.

adolski commented 11 months ago

The Book Tracker is going to use its own Google account. I've added the credentials to Box: https://uofi.box.com/s/9ag9oh2pax6kun21ukwqgs2mb27kmnf6

After you log into Google using this account, you should be able to access the book inventory URL: https://books.google.com/libraries/UIUC/_all_books

On this page, there is a link above the table to a "text-only version". That is what we want the Book Tracker to download.

The challenge is to programmatically get through the login process.

Part of the solution will involve adding the Google credentials to the Book Tracker's configuration. I guess it would look something like:

google:
  user:     <email>
  password: <password>

This will have to be done in config/credentials/demo.yml.enc, production.yml.enc, and template.yml (where the user/password are left blank--we don't want to commit plaintext passwords to version control!)

To edit the demo & production config, use bin/rails credentials:edit -e <environment>.

Once the credentials are in the configuration, you can access them from Ruby code like: Configuration.instance.google[:user]

gaurijo commented 11 months ago

Confirming I'm able to access the book inventory link now!

I'll look into some more automation tools (besides selenium) and see what might make the most sense for getting through the login process.

gaurijo commented 11 months ago

Also, is there a way for me to run the google books check in development? I can access the book inventory link and downloaded it as a text file. Then I tried running bin/rails books:check_google in my terminal and got this error: ArgumentError: missing required parameter params[:key]

In the Google Model code, I see that in order to initialize it needs an inventory_key, which is a string object of the text-file saved into a S3 bucket. Is there a way I can do this locally/for testing?

adolski commented 11 months ago

You can upload the _all_books.txt file to your book-tracker-temp bucket in Minio. You can use its web interface or else a command like: AWS_ACCESS_KEY_ID=minioadmin AWS_SECRET_ACCESS_KEY=minioadmin aws s3 cp --endpoint-url http://localhost:9000 /path/to/_all_books.txt s3://book-tracker-temp/_all_books.txt

Once it's in the bucket, invoke the task like: bin/rails "books:check_google[_all_books.txt]"

gaurijo commented 10 months ago

In the development.yml we do have credentials for accessing the google inventory url:

google_username: medusauiuc@google.com 
google_oauth_password: <password>

However the password is different than what is in Box right now (and I think you mentioned google oauth was tried previously).

Should the above be replaced with what's in box or should it be a separate configuration?

adolski commented 10 months ago

google_oauth_password is a Google app password that I added back when I was exploring logging in via OAuth. I probably left it in the .yml file by mistake, and it shouldn't be there, unless you can figure out how to get into the Google Books backend using OAuth.

gaurijo commented 10 months ago

(Tagging @srbbins as we were just chatting about this earlier)

I looked into this some more and unfortunately based on what I'm reading in the docs for the latest version of Google Books API, there is currently no way to programmatically fetch a partner institution's digital collections. The API still only supports searching/retrieving metadata on a book, searching based on a matched query and/or managing user specific bookshelves.

Maybe there's a different google books API that can integrate this, but the info is not documented publicly. If that seems unlikely, I think using a browser automation tool is the best way to proceed.

gaurijo commented 10 months ago

Quick update:

It's looking like Google has a lot of blocks in place to prevent logging in via automation tools, so I'm attempting to get things set up with using OAuth. Will keep you posted at Stand Up next week.

gaurijo commented 10 months ago

@adolski I've made some decent progress with getting OAuth set up, but am running into an Invalid request error that I can't seem to figure out. I've made a working PR that documents all the steps I took and has more context when you have a chance to look it over.

Image

Image