medusa-project / book-tracker

Medusa Book Tracker
0 stars 0 forks source link

HathiTrust check is failing #9

Closed adolski closed 1 year ago

adolski commented 1 year ago

The HathiTrust check has started failing with the following message:

HathiTrust check failed: undefined method `[]' for nil:NilClass

No modifications have been made to Book Tracker since this failure started happening, so something probably changed on the HathiTrust side.

(Reported by @henryborchers)

adolski commented 1 year ago

I was browsing the backup of Book Tracker issues from JIRA and it turns out that this happened before, on 8/27/2015:

The HTML on the HathiTrust website page that lists the HathiFiles changed, breaking the code that extracts the name of the latest HathiFile.

That's probably what happened here.

gaurijo commented 1 year ago

@adolski Looking at the HathiTrust website, there are files that start with hathi_upd_as well as hathi_full_.

The code for scraping data from the files seems like it's only looking for files starting with hathi_full_so I think that could be what's breaking:

node = page.css('div#content-area table.sticky-enabled a').
        select{ |h| h.text.start_with?('hathi_full_').sort{ |x,y| x.text <=> y.text }.reverse[0]

I know the checks/tasks only work in production, so is there a way I can test this check locally?

adolski commented 1 year ago

You can run the check locally using the bin/rails books:check_hathitrust command.

The hathi_full_ file is the one we want. I think what happened is that HathiTrust recently redesigned their website, changing the HTML structure, which broke the CSS query that you're seeing there. (I'm attaching a screenshot showing that the div#content-area is no longer there.) So the goal is ultimately to find the new URL of the hathi_full_ file.

Screenshot 2023-09-07 at 1 42 32 PM
gaurijo commented 1 year ago

I'm wanting to replace the current CSS query with what shows up when I right-click on the Copy selector for this:

page.css('#content > div.twocol > div.twocol-main > div > div.btable-wrapper > table > tbody > tr:nth-child(72) > td:nth-child(1) > a')

This still gives the same error for the check, and I'm not entirely sure if it's the syntax (using '>') and/or the actual elements/tags.

The tr:nth-child(72) > td:nth-child(1) > a chain also seems like it's not needed because of the .select method that comes right after this line of code:

select{|h| h.text.start_with?('hathi_full_')}.
                  sort{ |x,y| x.text <=> y.text }.reverse[0]

Does this line of thinking seem on track or am I missing something more obvious?

adolski commented 1 year ago

I'm looking at the Hathitrust.get_hathifile() method and the first thing that jumps out at me is that it's trying to access https://www.hathitrust.org/hathifiles, but when I open that in a browser, it redirects to https://www.hathitrust.org/member-libraries/resources-for-librarians/data-resources/hathifiles/. And it turns out that Net::HTTP doesn't automatically follow redirects. So, that's the first problem. Some possible solutions are:

  1. Change the URL in the code--but there is the risk that the URL will change again, and https://www.hathitrust.org/hathifiles seems more permanent, assuming that it will always redirect to the right URL
  2. Add some code to tell Net::HTTP to load the URL in the Location response header if it encounters a redirect (3xx) status code
  3. Switch to a different HTTP client library that does follow redirects--this would probably be overkill but also a good learning experience

The second problem is the CSS query. As written now, it's supposed to return all the anchor elements in the table. One way to test queries easily is to open the JavaScript console in your browser (with the Hathifiles page loaded) and run this at the prompt:

document.body.querySelectorAll('my query');

The way to tell when you've got it is when it returns a NodeList full of <a> elements.

This link might help: CSS selectors

gaurijo commented 1 year ago

Re: the lack of following redirects, I added a helper method to tell Net::HTTP to load based on the Location response header if there is a redirect.

Thank you for sharing the resources for the CSS selectors; i'll take a look through those and use the JavaScript console in my browser.

gaurijo commented 1 year ago

@adolski I'm hitting a wall and wanted to give a quick update. I found the correct query that returns a NodeList full of <a> elements, using my JavaScript console in browser:

const elements = document.body.querySelectorAll('.btable-wrapper table.btable tbody tr td a')

Screenshot for reference:

Image

However I'm having trouble applying this to extract the same elements with Nokogiri. Looking into it, I know it should return a Nokogiri::XML::NodeSetobject, but I'm getting stuck on manipulating this to extract the same elements I got in my console/browser.

elements = page.css('.btable-wrapper table.btable tbody tr td a')

puts "Node list: #{elements}"

My puts statement in this case just returns a blank space.

adolski commented 1 year ago

That's weird. When I run the same code, my puts prints out a whole bunch of Nokogiri::XML::Elements, suggesting that the CSS query is working.

My only idea would be to double-check that the response.body passed to Nokogiri::HTML() contains the string contents of the Hathifiles page.

gaurijo commented 1 year ago

Oh, yup! My response.body inside Nokogiri::HTML() is an empty string for some reason. I wonder if it's because of how I set up the helper method to handle redirects.

When I check Rails console i see that Net::HTTP.get_response(uri) gives an output of #<Net::HTTPFound 302 Found readbody=true>