lorenzodifuccia / safaribooks

Download and generate EPUB of your favorite books from O'Reilly Learning (aka Safari Books Online) library.
Do What The F*ck You Want To Public License
4.67k stars 692 forks source link

Crawler: error trying to parse this page: c02.xhtml #335

Open ivanpagac opened 1 year ago

ivanpagac commented 1 year ago
[13/Jan/2023 12:17:49] ** Welcome to SafariBooks! **
[13/Jan/2023 12:17:49] Logging into Safari Books Online...
[13/Jan/2023 12:17:52] Successfully authenticated.
[13/Jan/2023 12:17:52] Retrieving book info...
[13/Jan/2023 12:17:52] Title: The Rust Programming Language, 2nd Edition
[13/Jan/2023 12:17:52] Authors: Steve Klabnik, Carol Nichols
[13/Jan/2023 12:17:52] Identifier: 9781098156817
[13/Jan/2023 12:17:52] ISBN: 9781098156800
[13/Jan/2023 12:17:52] Publishers: No Starch Press
[13/Jan/2023 12:17:52] Rights: 
[13/Jan/2023 12:17:52] Description: The Rust Programming Language, 2nd Edition is the official guide to Rust 2021: an open source systems programming language that will help you write faster, more reliable software. Rust provides control of low-level details along with high-level ergonomics, allowing you to improve productivity and eliminate the hassle traditionally associated with low-level languages.Klabnik and Nichols, alumni of the Rust Core Team, share their knowledge to help you get the most out of Rustâ??s features so that ...
[13/Jan/2023 12:17:52] Release Date: 2023-02-28
[13/Jan/2023 12:17:52] URL: https://learning.oreilly.com/library/view/the-rust-programming/9781098156817/
[13/Jan/2023 12:17:52] Retrieving book chapters...
[13/Jan/2023 12:17:54] Output directory:
    /Users/ivan/Projects/safaribooks/Books/The Rust Programming Language 2nd Edition (9781098156817)
[13/Jan/2023 12:17:54] Book directory already exists: /Users/ivan/Projects/safaribooks/Books/The Rust Programming Language 2nd Edition (9781098156817)
[13/Jan/2023 12:17:54] CSSs directory already exists: /Users/ivan/Projects/safaribooks/Books/The Rust Programming Language 2nd Edition (9781098156817)/OEBPS/Styles
[13/Jan/2023 12:17:54] Images directory already exists: /Users/ivan/Projects/safaribooks/Books/The Rust Programming Language 2nd Edition (9781098156817)/OEBPS/Images
[13/Jan/2023 12:17:54] Downloading book contents... (35 chapters)
[13/Jan/2023 12:17:54] File `cover.xhtml` already exists.
    If you want to download again all the book,
    please delete the output directory '/Users/ivan/Projects/safaribooks/Books/The Rust Programming Language 2nd Edition (9781098156817)' and restart the program.
[13/Jan/2023 12:17:54] Document is empty
[13/Jan/2023 12:17:54] Crawler: error trying to parse this page: c02.xhtml (Chapter 2: Programming a Guessing Game)
    From: https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098156817/files/c02.xhtml
[13/Jan/2023 12:17:54] Last request done:
    URL: https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098156817/files/c02.xhtml
    DATA: None
    OTHERS: {}

    200

however the url itself is correct, i can display the page in the browser

https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098156817/files/c02.xhtml

azec-pdx commented 1 year ago

@lorenzodifuccia && @ivanpagac ,

I believe there is a set of problems introduced with later versions of Python that LXML hasn't addressed yet. I am watching the following:

  1. https://bugs.launchpad.net/lxml/+bug/1949271
  2. https://github.com/Donohue/medium-to-jekyll/pull/4/files
  3. https://github.com/Donohue/medium-to-jekyll/issues/3

Regardless of this external change in lxml, I found the issue in this project with handling emojis and other special unicode characters when requesting lxml to parse the document, for the versions of Python with which lxml behaves well.

I have addressed the issue in https://github.com/azec-pdx/safaribooks/tree/apiv2 . I was able to confirm positive results with testing on Book with IDs: 9781098156817 and 9781617297274 which both have some emojis and other offending characters. However, I was able to only get the parsing right with Python 3.9.16 and while using Python 3.9.10, it is still broken (I believe because of the additional issue linked above).

Screenshot 2023-03-27 at 9 08 37 AM Screenshot 2023-03-27 at 8 58 21 AM
azec-pdx commented 1 year ago

I've had different behaviors of lxml on same Python version between macOS running Apple M1 chip and macOS running Apple Intel chip. On M1 macOS, it basically errors as described above and my branch is handling that now, but on Intel macOS it never errors out.

jrwagz commented 1 year ago

@azec-pdx , I'm using an M series MacOS device and I was able to use the code on your branch (commit https://github.com/azec-pdx/safaribooks/commit/a2be61e7b968bcdfc5a537d0492e4e7e9c04e8f6) and was able to get around this same problem for myself. Thank you!

trsudarshan commented 1 year ago

@azec-pdx thank you, is there a version of lxml (fixing Python at 3.9.x), where this error can be avoided? If so, patching requirements.txt to that version of lxml may allow users to locally work around this problem, until a formal PR resolving it, gets merged.

dreampuf commented 1 year ago

347 fixed this issue.