Closed wevanthomas closed 7 years ago
I can see the problem - goodreads have changed their DOM so that our matching doesn't work in an annoyingly trivial way :) This might impact us in multiple places, so I'll leave it a while to see if @szarroug3 is around and has a preference on how to fix this.
Full details below, @wevanthomas you may want to ignore this.
Specifically, we've gone from
<div class="mainContent">
to
<div class="mainContent ">
The test at goodreads.py:525
is currently a bit fragile - as it assumes the entire class will be mainContent
, but extra classes could be added legitimately (mainContent somethingElse
) , or as we've seen they might just add a trailing whitespace to a single element list.
It would be better to look for elements where one of the class attributes is exactly mainContent
but AFAIK there isn't a native xquery way to do this. The best I can find is to change the first part of the test to change from this:
xpath('//div[@class="mainContent"]
to this:
xpath('//div[contains(concat(" ", @class, " "), " mainContent ")]
so the old line 525 now looks like this:
return self._page_source.xpath('//div[contains(concat(" ", @class, " "), " mainContent ")]//div[@id="imagecol"]//img[@id="coverImage"]')[0].get('src')
This (messier but more accurate) test will match a class that looks like any of these:
That seems to get things working, though not checked other things aren't silently failing. I can see at least one example where we are already matching for unexpected whitespace (line 313), so probably want to make this fix in a few places.
I've actually come across this problem myself. I fixed it by removing trailing whitespaces after any " so "mainContent " would be changed to "mainContent".. i didn't think about 'mainContent somethingElse' or 'another mainContent'. I've merged into my the parsing_fix branch.. you can see it in #92. How likely is it that you think that 'mainContent somethingElse' or 'another mainContent' would be valid. I actually don't know html that well.
Hi! I'm not convinced by the fix in #92, it seems as likely to break something as fix something :) Won't this strip spaces in text content, for example? What about html that uses single instead of double quotes?
In my experience of scraping webpages: anything can change at any time and in ways that make no sense. So I think it is entirely possible they'll add a second class to the list, so as it isn't hard to deal with it we may as well fix it now.
Of course, they can break things in far larger ways, but such is the joy of scraping!
Hey! it's been forever! also, yeah i didn't think it through.. good catch.. do you want to make the changes and just submit them to the branch? i don't have any changes that aren't already there.
I've not going time tonight, got a meeting later I need to get ready for. If you want to fix it quickly you could just back out your second commit, make the one line change I mentioned above, and then we can fix all the other places at a later date (ie. where it could be broken in future but currently is) - that would certainly get the released version fixed quickest.
Okay will do in a few hours
On Mon, Feb 6, 2017, 11:00 AM Anthony Toole notifications@github.com wrote:
I've not going time tonight, got a meeting later I need to get ready for. If you want to fix it quickly you could just back out your second commit, make the one line change I mentioned above, and then we can fix all the other places at a later date (ie. where it could be broken in future but currently is) - that would certainly get the released version fixed quickest.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/szarroug3/X-Ray_Calibre_Plugin/issues/93#issuecomment-277744388, or mute the thread https://github.com/notifications/unsubscribe-auth/AIEjUXVuS29MNxTV9ICWhIjXQo9RSpCrks5rZ1HFgaJpZM4L30TP .
Done, @stoduk can you review my pull request?
Will take a look this evening - that ok?
On 6 February 2017 at 20:45, Samreen Zarroug notifications@github.com wrote:
Done, @stoduk https://github.com/stoduk can you review my pull request?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/szarroug3/X-Ray_Calibre_Plugin/issues/93#issuecomment-277807755, or mute the thread https://github.com/notifications/unsubscribe-auth/AFCvIij5ifQWjStL2NvxSU5M6hKy3_H_ks5rZ4Z3gaJpZM4L30TP .
Sure
On Tue, Feb 7, 2017, 6:11 AM Anthony Toole notifications@github.com wrote:
Will take a look this evening - that ok?
On 6 February 2017 at 20:45, Samreen Zarroug notifications@github.com wrote:
Done, @stoduk https://github.com/stoduk can you review my pull request?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/szarroug3/X-Ray_Calibre_Plugin/issues/93#issuecomment-277807755 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AFCvIij5ifQWjStL2NvxSU5M6hKy3_H_ks5rZ4Z3gaJpZM4L30TP
.
— You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub https://github.com/szarroug3/X-Ray_Calibre_Plugin/issues/93#issuecomment-277981083, or mute the thread https://github.com/notifications/unsubscribe-auth/AIEjUbFkYA6kzCiYinjq_EYCw-sISqU2ks5raF-EgaJpZM4L30TP .
Starting job: Creating Files
02-06-2017 00:16:31 Initializing... 02-06-2017 00:16:31 Harry Potter and the Chamber of Secrets - J. K. Rowling 02-06-2017 00:16:31 Parsing Goodreads data... Job: "Creating Files" failed with error: Traceback (most recent call last): File "site-packages\calibre\gui2\threaded_jobs.py", line 84, in start_work File "calibre_plugins.xray_creator.lib.xray_creator", line 284, in create_files_event File "calibre_plugins.xray_creator.lib.book", line 210, in create_files_event File "calibre_plugins.xray_creator.lib.book", line 371, in _parse_goodreads_data File "calibre_plugins.xray_creator.lib.goodreads_parser", line 58, in parse File "calibre_plugins.xray_creator.lib.goodreads_parser", line 93, in _get_non_xray File "calibre_plugins.xray_creator.lib.goodreads_parser", line 525, in _get_book_image_url IndexError: list index out of range
Called with args: (,) {u'notifications': , u'abort': , u'log': }