Images broken - Githubissues

lukasthomas commented 4 years ago

I have the latest version and for some books the downloaded images are broken.

Example: 9783110495362 or 9783110434422

TheSnoozer commented 4 years ago

Hello, thanks for reporting this issue. Indeed I can reproduce this with book-id 9783110495362. The issue is quite settle, but essentially the image-crawler attempts to download the image from the wrong location:

[03/May/2020 01:26:42] Created: unit01.xhtml
[03/May/2020 01:26:43] Crawler: found a new image at https://learning.oreilly.com/library/view/rechnerarchitektur-2nd-edition/images/page005_1.jpg
[03/May/2020 01:26:43] Crawler: found a new image at https://learning.oreilly.com/library/view/rechnerarchitektur-2nd-edition/images/page005_2.jpg

One can observe that the downloader actually creates a file in the Images-Folder matching that file-name, however all files there are only 953 bytes and are in fact PNG-Files (and not jpg as the file ending suggests).

Simple reason seems to be that the correct URL is

https://learning.oreilly.com/library/view/rechnerarchitektur-2nd-edition/9783110495362/images/page005_1.jpg

(notice that the book-id is missing in the link we are trying to download). When fetching the URL as provided by the crawler one will get redirected to the default cover page (https://learning.oreilly.com/static/images/default_cover.c956912c958b.png) which is 953 bytes and a PNG-File. So the crawler attempts to download the default cover over and over again.

lukasthomas commented 4 years ago

I tried this:

link = urljoin(self.base_url + self.book_id + "/", link)

and now it seems to be working!

lorenzodifuccia / safaribooks

Images broken #212