Closed ivanpagac closed 1 year ago
Have you tried with latest version?
Yes, cloned today morning, same result, log attached
[21/May/2020 07:11:01] ** Welcome to SafariBooks! ** [21/May/2020 07:11:01] Logging into Safari Books Online... [21/May/2020 07:11:07] Successfully authenticated. [21/May/2020 07:11:07] Retrieving book info... [21/May/2020 07:11:07] Title: Node.js: Tools & Skills, 2nd Edition [21/May/2020 07:11:07] Authors: Manjunath M, Jay Raj, Nilson Jacques, Michael Wanyoike, James Hibbard [21/May/2020 07:11:07] Identifier: 9781098122836 [21/May/2020 07:11:07] ISBN: 9781925836394 [21/May/2020 07:11:07] Publishers: SitePoint [21/May/2020 07:11:07] Rights: Copyright © SitePoint [21/May/2020 07:11:07] Description: While there have been quite a few attempts to get JavaScript working as a server-side language, Node.js (frequently just called Node) has been the first environment that's gained any traction. It's now used by companies such as Netflix, Uber and Paypal to power their web apps. Node allows for blazingly fast performance; thanks to its event loop model, common tasks like network connection and database I/O can be executed very quickly indeed.In this book, we'll take a look at a selection of the re... [21/May/2020 07:11:07] Release Date: 2020-04-24 [21/May/2020 07:11:07] URL: https://learning.oreilly.com/library/view/nodejs-tools/9781098122836/ [21/May/2020 07:11:07] Retrieving book chapters... [21/May/2020 07:11:08] Output directory: /*************/Books/Node.js Tools _ Skills (9781098122836) [21/May/2020 07:11:08] Downloading book contents... (9 chapters) [21/May/2020 07:11:08] Crawler: found a new CSS at https://learning.oreilly.com/library/css/nodejs-tools/9781098122836/Styles/page_styles.css [21/May/2020 07:11:08] Crawler: found a new CSS at https://learning.oreilly.com/library/css/nodejs-tools/9781098122836/Styles/stylesheet.css [21/May/2020 07:11:08] Crawler: found a new CSS at https://learning.oreilly.com/static/CACHE/css/output.8054605313ed.css [21/May/2020 07:11:08] Created: node13-frontmatter.xhtml [21/May/2020 07:11:09] Created: node13-preface.xhtml [21/May/2020 07:11:09] Created: node13-ch1.xhtml [21/May/2020 07:11:09] Created: node13-ch2.xhtml [21/May/2020 07:11:10] Created: node13-ch3.xhtml [21/May/2020 07:11:11] Created: node13-ch4.xhtml [21/May/2020 07:11:11] Parser: book content's corrupted or not present: node13-ch5.html (Chapter 5: Top 5 Developer-friendly Node.js API Frameworks) [21/May/2020 07:11:11] Last request done: URL: https://learning.oreilly.com/api/v1/book/9781098122836/chapter-content/Text/node13-ch5.html DATA: None OTHERS: {}
Hi @lorenzodifuccia thanks for the tool.
I have the same problem with other book Fluent Python
Here the log:
[10/Sep/2020 14:06:53] ** Welcome to SafariBooks! **
[10/Sep/2020 14:06:53] Logging into Safari Books Online...
[10/Sep/2020 14:06:58] Successfully authenticated.
[10/Sep/2020 14:06:58] Retrieving book info...
[10/Sep/2020 14:06:58] Title: Fluent Python, 2nd Edition
[10/Sep/2020 14:06:58] Authors: Luciano Ramalho
[10/Sep/2020 14:06:58] Identifier: 9781492056348
[10/Sep/2020 14:06:58] ISBN: 9781492056355
[10/Sep/2020 14:06:58] Publishers: O'Reilly Media, Inc.
[10/Sep/2020 14:06:58] Rights: Copyright © 2021 Luciano Ramalho
[10/Sep/2020 14:06:58] Description: Python’s simplicity lets you become productive quickly, but often this means you aren’t using everything it has to offer. With the updated edition of this hands-on guide, you’ll learn how to write effective, modern Python 3 code by leveraging its best ideas.Don’t waste time bending Python to fit patterns you learned in other languages. Discover and apply idiomatic Python 3 features beyond your past experience. Author Luciano Ramalho guides you through Python’s core language features and librarie...
[10/Sep/2020 14:06:58] Release Date: 2021-07-25
[10/Sep/2020 14:06:58] URL: https://learning.oreilly.com/library/view/fluent-python-2nd/9781492056348/
[10/Sep/2020 14:06:58] Retrieving book chapters...
[10/Sep/2020 14:07:01] Output directory:
/Users/leninluque/safaribooks/Books/Fluent Python 2nd Edition (9781492056348)
[10/Sep/2020 14:07:01] Downloading book contents... (23 chapters)
[10/Sep/2020 14:07:01] Crawler: found a new CSS at https://learning.oreilly.com/library/css/fluent-python-2nd/9781492056348/epub.css
[10/Sep/2020 14:07:01] Crawler: found a new CSS at https://learning.oreilly.com/static/CACHE/css/output.731fc84c4f9a.css
[10/Sep/2020 14:07:01] Created: cover.xhtml
[10/Sep/2020 14:07:01] Created: toc01.xhtml
[10/Sep/2020 14:07:01] Created: titlepage01.xhtml
[10/Sep/2020 14:07:02] Created: copyright-page01.xhtml
[10/Sep/2020 14:07:02] Created: dedication01.xhtml
[10/Sep/2020 14:07:02] Created: preface01.xhtml
[10/Sep/2020 14:07:02] Created: part01.xhtml
[10/Sep/2020 14:07:03] Created: ch01.xhtml
[10/Sep/2020 14:07:03] Created: part02.xhtml
[10/Sep/2020 14:07:03] Created: ch02.xhtml
[10/Sep/2020 14:07:04] Created: ch03.xhtml
[10/Sep/2020 14:07:04] Parser: book content's corrupted or not present: ch04.html (4. Text versus Bytes)
[10/Sep/2020 14:07:04] Last request done:
URL: https://learning.oreilly.com/api/v1/book/9781492056348/chapter-content/ch04.html
DATA: None
OTHERS: {}
200
Connection: keep-alive
Content-Length: 59142
Server: openresty/1.17.8.2
Content-Type: text/html; charset=utf-8
Allow: GET, HEAD, OPTIONS
X-Frame-Options: SAMEORIGIN
ETag: W/"853ff7c0c7c3aa72e3486ea1898ec20e"
Content-Language: en-US
strict-transport-security: "max-age=31536000; includeSubDomains"
x-content-type-options: nosniff
x-xss-protection: 1; mode=block
Content-Encoding: gzip
Cache-Control: s-maxage=31536000
Accept-Ranges: bytes
Date: Thu, 10 Sep 2020 17:07:04 GMT
Via: 1.1 varnish
X-Client-IP: 190.162.8.22
X-Served-By: cache-scl19422-SCL
X-Cache: MISS
X-Cache-Hits: 0
X-Timer: S1599757624.246264,VS0,VE311
Vary: Accept-Encoding
Maybe if you have any idea what happens i can help you to fix it.
I think the page have a lot images and icons maybe is there the problem.
https://learning.oreilly.com/library/view/advanced-engineering-mathematics/9781284105971/ Also fails to download
Same problem here with Fluent Python 2nd Ed.
Please upgrade lxml to the latest version.
In my case, lxml<=4.4.2 can't parse html content contains mathematical unicode characters(https://stackoverflow.com/questions/69334692/lxml-can-not-parse-html-fragment-contains-certain-unicode-character )
A dead ugly workaround is to download the failing file again and then to use a slightly different way to parse.
Funnily enough the object returned by the parser has the wrong type Element
and must be converted to a HtmlElement
to match the expectations of the code using it later on. For this I apply fromstring
and tostring
conversions, which is certainly not an efficient approach, but my lxml foo is simply too weak. In my case this code executes rarely enough and is fast enough so that I don't care.
Because the whole thing is so cheesy and I don't even understand the root cause, I don't plan to create an MR. So the next best thing is to provide the patch below. To apply the patch store the patch into a file and apply it with git apply <patch file>
onto the safaribooks git repo. If the patch fails to apply consider to checkout version af22b43c1 or a sufficiently compatible version and try again.
Limitation: Because I use path /tmp
the hack will only work on *nix-based systems (incl. Macs), because I didn't bother to use use StringIO
or at least the pythonic temporary file module.
diff --git a/safaribooks.py b/safaribooks.py
index 1d23bee..461e2ef 100755
--- a/safaribooks.py
+++ b/safaribooks.py
@@ -605,6 +605,16 @@ class SafariBooks:
return root
+ def download_html_to_file(self, url, file_name):
+ response = self.requests_provider(url)
+ if response == 0 or response.status_code != 200:
+ self.display.exit(
+ "Crawler: error trying to retrieve this page: %s (%s)\n From: %s" %
+ (self.filename, self.chapter_title, url)
+ )
+ with open(file_name, 'w') as file:
+ file.write(response.text)
+
@staticmethod
def url_is_absolute(url):
return bool(urlparse(url).netloc)
@@ -652,17 +662,27 @@ class SafariBooks:
return None
- def parse_html(self, root, first_page=False):
+ def parse_html(self, root, url, first_page=False):
if random() > 0.8:
if len(root.xpath("//div[@class='controls']/a/text()")):
self.display.exit(self.display.api_error(" "))
book_content = root.xpath("//div[@id='sbo-rt-content']")
if not len(book_content):
- self.display.exit(
- "Parser: book content's corrupted or not present: %s (%s)" %
- (self.filename, self.chapter_title)
- )
+ filename = '/tmp/ch.html'
+ self.download_html_to_file(url, filename)
+ parser = etree.HTMLParser()
+ tree = etree.parse(filename, parser)
+ book_content = tree.xpath("//div[@id='sbo-rt-content']")
+ if not len(book_content):
+ self.display.exit(
+ "Parser: book content's corrupted or not present: %s (%s)" %
+ (self.filename, self.chapter_title)
+ )
+ # KLUDGE(KNR): When parsing this way the resulting object has type Element
+ # instead of HtmlElement. So perform a crude conversion into the right type.
+ from lxml.html import fromstring, tostring
+ book_content[0] = html.fromstring(tostring(book_content[0]))
page_css = ""
if len(self.chapter_stylesheets):
@@ -846,7 +867,10 @@ class SafariBooks:
self.display.book_ad_info = 2
else:
- self.save_page_html(self.parse_html(self.get_html(next_chapter["content"]), first_page))
+ chapter_ = next_chapter["content"]
+ html_ = self.get_html(chapter_)
+ parsed_page_ = self.parse_html(html_, chapter_, first_page)
+ self.save_page_html(parsed_page_)
self.display.state(len_books, len_books - len(self.chapters_queue))
chapter_ = next_chapter["content"]
- html_ = self.gethtml(chapter)
- parsedpage = self.parsehtml(html, chapter_, first_page)
- self.save_page_html(parsedpage)
This solved for me
A dead ugly workaround is to download the failing file again and then to use a slightly different way to parse.
Funnily enough the object returned by the parser has the wrong type
Element
and must be converted to aHtmlElement
to match the expectations of the code using it later on. For this I applyfromstring
andtostring
conversions, which is certainly not an efficient approach, but my lxml foo is simply too weak. In my case this code executes rarely enough and is fast enough so that I don't care.Because the whole thing is so cheesy and I don't even understand the root cause, I don't plan to create an MR. So the next best thing is to provide the patch below. To apply the patch store the patch into a file and apply it with
git apply <patch file>
onto the safaribooks git repo. If the patch fails to apply consider to checkout version af22b43 or a sufficiently compatible version and try again.Limitation: Because I use path
/tmp
the hack will only work on *nix-based systems (incl. Macs), because I didn't bother to use useStringIO
or at least the pythonic temporary file module.diff --git a/safaribooks.py b/safaribooks.py index 1d23bee..461e2ef 100755 --- a/safaribooks.py +++ b/safaribooks.py @@ -605,6 +605,16 @@ class SafariBooks: return root + def download_html_to_file(self, url, file_name): + response = self.requests_provider(url) + if response == 0 or response.status_code != 200: + self.display.exit( + "Crawler: error trying to retrieve this page: %s (%s)\n From: %s" % + (self.filename, self.chapter_title, url) + ) + with open(file_name, 'w') as file: + file.write(response.text) + @staticmethod def url_is_absolute(url): return bool(urlparse(url).netloc) @@ -652,17 +662,27 @@ class SafariBooks: return None - def parse_html(self, root, first_page=False): + def parse_html(self, root, url, first_page=False): if random() > 0.8: if len(root.xpath("//div[@class='controls']/a/text()")): self.display.exit(self.display.api_error(" ")) book_content = root.xpath("//div[@id='sbo-rt-content']") if not len(book_content): - self.display.exit( - "Parser: book content's corrupted or not present: %s (%s)" % - (self.filename, self.chapter_title) - ) + filename = '/tmp/ch.html' + self.download_html_to_file(url, filename) + parser = etree.HTMLParser() + tree = etree.parse(filename, parser) + book_content = tree.xpath("//div[@id='sbo-rt-content']") + if not len(book_content): + self.display.exit( + "Parser: book content's corrupted or not present: %s (%s)" % + (self.filename, self.chapter_title) + ) + # KLUDGE(KNR): When parsing this way the resulting object has type Element + # instead of HtmlElement. So perform a crude conversion into the right type. + from lxml.html import fromstring, tostring + book_content[0] = html.fromstring(tostring(book_content[0])) page_css = "" if len(self.chapter_stylesheets): @@ -846,7 +867,10 @@ class SafariBooks: self.display.book_ad_info = 2 else: - self.save_page_html(self.parse_html(self.get_html(next_chapter["content"]), first_page)) + chapter_ = next_chapter["content"] + html_ = self.get_html(chapter_) + parsed_page_ = self.parse_html(html_, chapter_, first_page) + self.save_page_html(parsed_page_) self.display.state(len_books, len_books - len(self.chapters_queue))
This works for me. Just one fix to avoid encoding issue:
parser = etree.HTMLParser(encoding='utf8')
This works for me. Just one fix to avoid encoding issue:
parser = etree.HTMLParser(encoding='utf8')
also need to add from_encoding for BeautifulSoup:
tsoup = bs(txt, 'html.parser', from_encoding='utf8')
#] Parser: book content's corrupted or not present: node13-ch5.html (Chapter 5: Top 5 Developer-friendly Node.js API Frameworks)
however i can browse the page in browser without problem
https://learning.oreilly.com/library/view/nodejs-tools/9781098122836/Text/node13-ch5.html