Closed lukasthomas closed 4 years ago
Hello, thanks for reporting this issue. Indeed I can reproduce this with book-id 9783110495362. The issue is quite settle, but essentially the image-crawler attempts to download the image from the wrong location:
[03/May/2020 01:26:42] Created: unit01.xhtml
[03/May/2020 01:26:43] Crawler: found a new image at https://learning.oreilly.com/library/view/rechnerarchitektur-2nd-edition/images/page005_1.jpg
[03/May/2020 01:26:43] Crawler: found a new image at https://learning.oreilly.com/library/view/rechnerarchitektur-2nd-edition/images/page005_2.jpg
One can observe that the downloader actually creates a file in the Images-Folder matching that file-name, however all files there are only 953 bytes and are in fact PNG-Files (and not jpg as the file ending suggests).
Simple reason seems to be that the correct URL is
https://learning.oreilly.com/library/view/rechnerarchitektur-2nd-edition/9783110495362/images/page005_1.jpg
(notice that the book-id is missing in the link we are trying to download). When fetching the URL as provided by the crawler one will get redirected to the default cover page (https://learning.oreilly.com/static/images/default_cover.c956912c958b.png) which is 953 bytes and a PNG-File. So the crawler attempts to download the default cover over and over again.
I tried this:
link = urljoin(self.base_url + self.book_id + "/", link)
and now it seems to be working!
I have the latest version and for some books the downloaded images are broken.
Example: 9783110495362 or 9783110434422