Parser: book content's corrupted or not present: 9781098122836

ivanpagac commented 4 years ago

#] Parser: book content's corrupted or not present: node13-ch5.html (Chapter 5: Top 5 Developer-friendly Node.js API Frameworks)

however i can browse the page in browser without problem

https://learning.oreilly.com/library/view/nodejs-tools/9781098122836/Text/node13-ch5.html

lorenzodifuccia commented 4 years ago

Have you tried with latest version?

ivanpagac commented 4 years ago

Yes, cloned today morning, same result, log attached

[21/May/2020 07:11:01] ** Welcome to SafariBooks! ** [21/May/2020 07:11:01] Logging into Safari Books Online... [21/May/2020 07:11:07] Successfully authenticated. [21/May/2020 07:11:07] Retrieving book info... [21/May/2020 07:11:07] Title: Node.js: Tools & Skills, 2nd Edition [21/May/2020 07:11:07] Authors: Manjunath M, Jay Raj, Nilson Jacques, Michael Wanyoike, James Hibbard [21/May/2020 07:11:07] Identifier: 9781098122836 [21/May/2020 07:11:07] ISBN: 9781925836394 [21/May/2020 07:11:07] Publishers: SitePoint [21/May/2020 07:11:07] Rights: Copyright © SitePoint [21/May/2020 07:11:07] Description: While there have been quite a few attempts to get JavaScript working as a server-side language, Node.js (frequently just called Node) has been the first environment that's gained any traction. It's now used by companies such as Netflix, Uber and Paypal to power their web apps. Node allows for blazingly fast performance; thanks to its event loop model, common tasks like network connection and database I/O can be executed very quickly indeed.In this book, we'll take a look at a selection of the re... [21/May/2020 07:11:07] Release Date: 2020-04-24 [21/May/2020 07:11:07] URL: https://learning.oreilly.com/library/view/nodejs-tools/9781098122836/ [21/May/2020 07:11:07] Retrieving book chapters... [21/May/2020 07:11:08] Output directory: /*************/Books/Node.js Tools _ Skills (9781098122836) [21/May/2020 07:11:08] Downloading book contents... (9 chapters) [21/May/2020 07:11:08] Crawler: found a new CSS at https://learning.oreilly.com/library/css/nodejs-tools/9781098122836/Styles/page_styles.css [21/May/2020 07:11:08] Crawler: found a new CSS at https://learning.oreilly.com/library/css/nodejs-tools/9781098122836/Styles/stylesheet.css [21/May/2020 07:11:08] Crawler: found a new CSS at https://learning.oreilly.com/static/CACHE/css/output.8054605313ed.css [21/May/2020 07:11:08] Created: node13-frontmatter.xhtml [21/May/2020 07:11:09] Created: node13-preface.xhtml [21/May/2020 07:11:09] Created: node13-ch1.xhtml [21/May/2020 07:11:09] Created: node13-ch2.xhtml [21/May/2020 07:11:10] Created: node13-ch3.xhtml [21/May/2020 07:11:11] Created: node13-ch4.xhtml [21/May/2020 07:11:11] Parser: book content's corrupted or not present: node13-ch5.html (Chapter 5: Top 5 Developer-friendly Node.js API Frameworks) [21/May/2020 07:11:11] Last request done: URL: https://learning.oreilly.com/api/v1/book/9781098122836/chapter-content/Text/node13-ch5.html DATA: None OTHERS: {}

xleninx commented 4 years ago

Hi @lorenzodifuccia thanks for the tool.

I have the same problem with other book Fluent Python

Here the log:

[10/Sep/2020 14:06:53] ** Welcome to SafariBooks! **
[10/Sep/2020 14:06:53] Logging into Safari Books Online...
[10/Sep/2020 14:06:58] Successfully authenticated.
[10/Sep/2020 14:06:58] Retrieving book info...
[10/Sep/2020 14:06:58] Title: Fluent Python, 2nd Edition
[10/Sep/2020 14:06:58] Authors: Luciano Ramalho
[10/Sep/2020 14:06:58] Identifier: 9781492056348
[10/Sep/2020 14:06:58] ISBN: 9781492056355
[10/Sep/2020 14:06:58] Publishers: O'Reilly Media, Inc.
[10/Sep/2020 14:06:58] Rights: Copyright © 2021 Luciano Ramalho
[10/Sep/2020 14:06:58] Description: Python’s simplicity lets you become productive quickly, but often this means you aren’t using everything it has to offer. With the updated edition of this hands-on guide, you’ll learn how to write effective, modern Python 3 code by leveraging its best ideas.Don’t waste time bending Python to fit patterns you learned in other languages. Discover and apply idiomatic Python 3 features beyond your past experience. Author Luciano Ramalho guides you through Python’s core language features and librarie...
[10/Sep/2020 14:06:58] Release Date: 2021-07-25
[10/Sep/2020 14:06:58] URL: https://learning.oreilly.com/library/view/fluent-python-2nd/9781492056348/
[10/Sep/2020 14:06:58] Retrieving book chapters...
[10/Sep/2020 14:07:01] Output directory:
    /Users/leninluque/safaribooks/Books/Fluent Python 2nd Edition (9781492056348)
[10/Sep/2020 14:07:01] Downloading book contents... (23 chapters)
[10/Sep/2020 14:07:01] Crawler: found a new CSS at https://learning.oreilly.com/library/css/fluent-python-2nd/9781492056348/epub.css
[10/Sep/2020 14:07:01] Crawler: found a new CSS at https://learning.oreilly.com/static/CACHE/css/output.731fc84c4f9a.css
[10/Sep/2020 14:07:01] Created: cover.xhtml
[10/Sep/2020 14:07:01] Created: toc01.xhtml
[10/Sep/2020 14:07:01] Created: titlepage01.xhtml
[10/Sep/2020 14:07:02] Created: copyright-page01.xhtml
[10/Sep/2020 14:07:02] Created: dedication01.xhtml
[10/Sep/2020 14:07:02] Created: preface01.xhtml
[10/Sep/2020 14:07:02] Created: part01.xhtml
[10/Sep/2020 14:07:03] Created: ch01.xhtml
[10/Sep/2020 14:07:03] Created: part02.xhtml
[10/Sep/2020 14:07:03] Created: ch02.xhtml
[10/Sep/2020 14:07:04] Created: ch03.xhtml
[10/Sep/2020 14:07:04] Parser: book content's corrupted or not present: ch04.html (4. Text versus Bytes)
[10/Sep/2020 14:07:04] Last request done:
    URL: https://learning.oreilly.com/api/v1/book/9781492056348/chapter-content/ch04.html
    DATA: None
    OTHERS: {}

    200
    Connection: keep-alive
    Content-Length: 59142
    Server: openresty/1.17.8.2
    Content-Type: text/html; charset=utf-8
    Allow: GET, HEAD, OPTIONS
    X-Frame-Options: SAMEORIGIN
    ETag: W/"853ff7c0c7c3aa72e3486ea1898ec20e"
    Content-Language: en-US
    strict-transport-security: "max-age=31536000; includeSubDomains"
    x-content-type-options: nosniff
    x-xss-protection: 1; mode=block
    Content-Encoding: gzip
    Cache-Control: s-maxage=31536000
    Accept-Ranges: bytes
    Date: Thu, 10 Sep 2020 17:07:04 GMT
    Via: 1.1 varnish
    X-Client-IP: 190.162.8.22
    X-Served-By: cache-scl19422-SCL
    X-Cache: MISS
    X-Cache-Hits: 0
    X-Timer: S1599757624.246264,VS0,VE311
    Vary: Accept-Encoding

Maybe if you have any idea what happens i can help you to fix it.

I think the page have a lot images and icons maybe is there the problem.

Bomberdash commented 3 years ago

https://learning.oreilly.com/library/view/advanced-engineering-mathematics/9781284105971/ Also fails to download

abreumatheus commented 3 years ago

Same problem here with Fluent Python 2nd Ed.

glasslion commented 3 years ago

Please upgrade lxml to the latest version.

In my case, lxml<=4.4.2 can't parse html content contains mathematical unicode characters(https://stackoverflow.com/questions/69334692/lxml-can-not-parse-html-fragment-contains-certain-unicode-character )

rknuus commented 2 years ago

A dead ugly workaround is to download the failing file again and then to use a slightly different way to parse.

Funnily enough the object returned by the parser has the wrong type Element and must be converted to a HtmlElement to match the expectations of the code using it later on. For this I apply fromstring and tostring conversions, which is certainly not an efficient approach, but my lxml foo is simply too weak. In my case this code executes rarely enough and is fast enough so that I don't care.

Because the whole thing is so cheesy and I don't even understand the root cause, I don't plan to create an MR. So the next best thing is to provide the patch below. To apply the patch store the patch into a file and apply it with git apply <patch file> onto the safaribooks git repo. If the patch fails to apply consider to checkout version af22b43c1 or a sufficiently compatible version and try again.

Limitation: Because I use path /tmp the hack will only work on *nix-based systems (incl. Macs), because I didn't bother to use use StringIO or at least the pythonic temporary file module.

diff --git a/safaribooks.py b/safaribooks.py
index 1d23bee..461e2ef 100755
--- a/safaribooks.py
+++ b/safaribooks.py
@@ -605,6 +605,16 @@ class SafariBooks:

         return root

+    def download_html_to_file(self, url, file_name):
+        response = self.requests_provider(url)
+        if response == 0 or response.status_code != 200:
+            self.display.exit(
+                "Crawler: error trying to retrieve this page: %s (%s)\n    From: %s" %
+                (self.filename, self.chapter_title, url)
+            )
+        with open(file_name, 'w') as file:
+            file.write(response.text)
+
     @staticmethod
     def url_is_absolute(url):
         return bool(urlparse(url).netloc)
@@ -652,17 +662,27 @@ class SafariBooks:

         return None

-    def parse_html(self, root, first_page=False):
+    def parse_html(self, root, url, first_page=False):
         if random() > 0.8:
             if len(root.xpath("//div[@class='controls']/a/text()")):
                 self.display.exit(self.display.api_error(" "))

         book_content = root.xpath("//div[@id='sbo-rt-content']")
         if not len(book_content):
-            self.display.exit(
-                "Parser: book content's corrupted or not present: %s (%s)" %
-                (self.filename, self.chapter_title)
-            )
+            filename = '/tmp/ch.html'
+            self.download_html_to_file(url, filename)
+            parser = etree.HTMLParser()
+            tree = etree.parse(filename, parser)
+            book_content = tree.xpath("//div[@id='sbo-rt-content']")
+            if not len(book_content):
+                self.display.exit(
+                    "Parser: book content's corrupted or not present: %s (%s)" %
+                    (self.filename, self.chapter_title)
+                )
+            # KLUDGE(KNR): When parsing this way the resulting object has type Element
+            # instead of HtmlElement. So perform a crude conversion into the right type.
+            from lxml.html import fromstring, tostring
+            book_content[0] = html.fromstring(tostring(book_content[0]))

         page_css = ""
         if len(self.chapter_stylesheets):
@@ -846,7 +867,10 @@ class SafariBooks:
                     self.display.book_ad_info = 2

             else:
-                self.save_page_html(self.parse_html(self.get_html(next_chapter["content"]), first_page))
+                chapter_ = next_chapter["content"]
+                html_ = self.get_html(chapter_)
+                parsed_page_ = self.parse_html(html_, chapter_, first_page)
+                self.save_page_html(parsed_page_)

             self.display.state(len_books, len_books - len(self.chapters_queue))

jvmachadorj commented 2 years ago

chapter_ = next_chapter["content"]

html_ = self.gethtml(chapter)

parsedpage = self.parsehtml(html, chapter_, first_page)

self.save_page_html(parsedpage)

This solved for me

astkaasa commented 2 years ago

A dead ugly workaround is to download the failing file again and then to use a slightly different way to parse.

Funnily enough the object returned by the parser has the wrong type Element and must be converted to a HtmlElement to match the expectations of the code using it later on. For this I apply fromstring and tostring conversions, which is certainly not an efficient approach, but my lxml foo is simply too weak. In my case this code executes rarely enough and is fast enough so that I don't care.

Because the whole thing is so cheesy and I don't even understand the root cause, I don't plan to create an MR. So the next best thing is to provide the patch below. To apply the patch store the patch into a file and apply it with git apply <patch file> onto the safaribooks git repo. If the patch fails to apply consider to checkout version af22b43 or a sufficiently compatible version and try again.

Limitation: Because I use path /tmp the hack will only work on *nix-based systems (incl. Macs), because I didn't bother to use use StringIO or at least the pythonic temporary file module.

diff --git a/safaribooks.py b/safaribooks.py
index 1d23bee..461e2ef 100755
--- a/safaribooks.py
+++ b/safaribooks.py
@@ -605,6 +605,16 @@ class SafariBooks:

         return root

+    def download_html_to_file(self, url, file_name):
+        response = self.requests_provider(url)
+        if response == 0 or response.status_code != 200:
+            self.display.exit(
+                "Crawler: error trying to retrieve this page: %s (%s)\n    From: %s" %
+                (self.filename, self.chapter_title, url)
+            )
+        with open(file_name, 'w') as file:
+            file.write(response.text)
+
     @staticmethod
     def url_is_absolute(url):
         return bool(urlparse(url).netloc)
@@ -652,17 +662,27 @@ class SafariBooks:

         return None

-    def parse_html(self, root, first_page=False):
+    def parse_html(self, root, url, first_page=False):
         if random() > 0.8:
             if len(root.xpath("//div[@class='controls']/a/text()")):
                 self.display.exit(self.display.api_error(" "))

         book_content = root.xpath("//div[@id='sbo-rt-content']")
         if not len(book_content):
-            self.display.exit(
-                "Parser: book content's corrupted or not present: %s (%s)" %
-                (self.filename, self.chapter_title)
-            )
+            filename = '/tmp/ch.html'
+            self.download_html_to_file(url, filename)
+            parser = etree.HTMLParser()
+            tree = etree.parse(filename, parser)
+            book_content = tree.xpath("//div[@id='sbo-rt-content']")
+            if not len(book_content):
+                self.display.exit(
+                    "Parser: book content's corrupted or not present: %s (%s)" %
+                    (self.filename, self.chapter_title)
+                )
+            # KLUDGE(KNR): When parsing this way the resulting object has type Element
+            # instead of HtmlElement. So perform a crude conversion into the right type.
+            from lxml.html import fromstring, tostring
+            book_content[0] = html.fromstring(tostring(book_content[0]))

         page_css = ""
         if len(self.chapter_stylesheets):
@@ -846,7 +867,10 @@ class SafariBooks:
                     self.display.book_ad_info = 2

             else:
-                self.save_page_html(self.parse_html(self.get_html(next_chapter["content"]), first_page))
+                chapter_ = next_chapter["content"]
+                html_ = self.get_html(chapter_)
+                parsed_page_ = self.parse_html(html_, chapter_, first_page)
+                self.save_page_html(parsed_page_)

             self.display.state(len_books, len_books - len(self.chapters_queue))

This works for me. Just one fix to avoid encoding issue:

parser = etree.HTMLParser(encoding='utf8')

astkaasa commented 2 years ago

This works for me. Just one fix to avoid encoding issue:

parser = etree.HTMLParser(encoding='utf8')

also need to add from_encoding for BeautifulSoup:

tsoup = bs(txt, 'html.parser', from_encoding='utf8')

lorenzodifuccia / safaribooks

Parser: book content's corrupted or not present: 9781098122836 #208