Open ivanpagac opened 1 year ago
@lorenzodifuccia && @ivanpagac ,
I believe there is a set of problems introduced with later versions of Python that LXML hasn't addressed yet. I am watching the following:
Regardless of this external change in lxml, I found the issue in this project with handling emojis and other special unicode characters when requesting lxml to parse the document, for the versions of Python with which lxml behaves well.
I have addressed the issue in https://github.com/azec-pdx/safaribooks/tree/apiv2 .
I was able to confirm positive results with testing on Book with IDs: 9781098156817
and 9781617297274
which both have some emojis and other offending characters. However, I was able to only get the parsing right with Python 3.9.16 and while using Python 3.9.10, it is still broken (I believe because of the additional issue linked above).
I've had different behaviors of lxml
on same Python version between macOS running Apple M1 chip and macOS running Apple Intel chip. On M1 macOS, it basically errors as described above and my branch is handling that now, but on Intel macOS it never errors out.
@azec-pdx , I'm using an M series MacOS device and I was able to use the code on your branch (commit https://github.com/azec-pdx/safaribooks/commit/a2be61e7b968bcdfc5a537d0492e4e7e9c04e8f6) and was able to get around this same problem for myself. Thank you!
@azec-pdx thank you, is there a version of lxml (fixing Python at 3.9.x), where this error can be avoided? If so, patching requirements.txt to that version of lxml may allow users to locally work around this problem, until a formal PR resolving it, gets merged.
however the url itself is correct, i can display the page in the browser
https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098156817/files/c02.xhtml