Closed adolski closed 1 year ago
I was browsing the backup of Book Tracker issues from JIRA and it turns out that this happened before, on 8/27/2015:
The HTML on the HathiTrust website page that lists the HathiFiles changed, breaking the code that extracts the name of the latest HathiFile.
That's probably what happened here.
@adolski Looking at the HathiTrust website, there are files that start with hathi_upd_
as well as hathi_full_
.
The code for scraping data from the files seems like it's only looking for files starting with hathi_full_
so I think that could be what's breaking:
node = page.css('div#content-area table.sticky-enabled a').
select{ |h| h.text.start_with?('hathi_full_').sort{ |x,y| x.text <=> y.text }.reverse[0]
I know the checks/tasks only work in production, so is there a way I can test this check locally?
You can run the check locally using the bin/rails books:check_hathitrust
command.
The hathi_full_
file is the one we want. I think what happened is that HathiTrust recently redesigned their website, changing the HTML structure, which broke the CSS query that you're seeing there. (I'm attaching a screenshot showing that the div#content-area
is no longer there.) So the goal is ultimately to find the new URL of the hathi_full_
file.
I'm wanting to replace the current CSS query with what shows up when I right-click on the Copy selector
for this:
page.css('#content > div.twocol > div.twocol-main > div > div.btable-wrapper > table > tbody > tr:nth-child(72) > td:nth-child(1) > a')
This still gives the same error for the check, and I'm not entirely sure if it's the syntax (using '>')
and/or the actual elements/tags.
The tr:nth-child(72) > td:nth-child(1) > a
chain also seems like it's not needed because of the .select
method that comes right after this line of code:
select{|h| h.text.start_with?('hathi_full_')}.
sort{ |x,y| x.text <=> y.text }.reverse[0]
Does this line of thinking seem on track or am I missing something more obvious?
I'm looking at the Hathitrust.get_hathifile()
method and the first thing that jumps out at me is that it's trying to access https://www.hathitrust.org/hathifiles, but when I open that in a browser, it redirects to https://www.hathitrust.org/member-libraries/resources-for-librarians/data-resources/hathifiles/. And it turns out that Net::HTTP doesn't automatically follow redirects. So, that's the first problem. Some possible solutions are:
https://www.hathitrust.org/hathifiles
seems more permanent, assuming that it will always redirect to the right URLLocation
response header if it encounters a redirect (3xx) status codeThe second problem is the CSS query. As written now, it's supposed to return all the anchor elements in the table. One way to test queries easily is to open the JavaScript console in your browser (with the Hathifiles page loaded) and run this at the prompt:
document.body.querySelectorAll('my query');
The way to tell when you've got it is when it returns a NodeList full of <a>
elements.
This link might help: CSS selectors
Re: the lack of following redirects, I added a helper method to tell Net::HTTP to load based on the Location response header if there is a redirect.
Thank you for sharing the resources for the CSS selectors; i'll take a look through those and use the JavaScript console in my browser.
@adolski I'm hitting a wall and wanted to give a quick update. I found the correct query that returns a NodeList full of <a>
elements, using my JavaScript console in browser:
const elements = document.body.querySelectorAll('.btable-wrapper table.btable tbody tr td a')
Screenshot for reference:
However I'm having trouble applying this to extract the same elements with Nokogiri. Looking into it, I know it should return a Nokogiri::XML::NodeSet
object, but I'm getting stuck on manipulating this to extract the same elements I got in my console/browser.
elements = page.css('.btable-wrapper table.btable tbody tr td a')
puts "Node list: #{elements}"
My puts
statement in this case just returns a blank space.
That's weird. When I run the same code, my puts
prints out a whole bunch of Nokogiri::XML::Elements, suggesting that the CSS query is working.
My only idea would be to double-check that the response.body
passed to Nokogiri::HTML()
contains the string contents of the Hathifiles page.
Oh, yup! My response.body
inside Nokogiri::HTML()
is an empty string for some reason. I wonder if it's because of how I set up the helper method to handle redirects.
When I check Rails console i see that Net::HTTP.get_response(uri) gives an output of #<Net::HTTPFound 302 Found readbody=true>
The HathiTrust check has started failing with the following message:
HathiTrust check failed: undefined method `[]' for nil:NilClass
No modifications have been made to Book Tracker since this failure started happening, so something probably changed on the HathiTrust side.
(Reported by @henryborchers)