Closed AtropsCooper closed 1 year ago
This is not an effective fix as it introduces new problems. Not all requests return HTML documents, so this approach may lead to errors or exceptions.
This error typically indicates that the HTML document content obtained from resp.text
is empty or incomplete, causing the fromstring()
method to fail in parsing it as a valid document object and leading to a parsing error.
The possible cause could be that the response returned from the server is empty or incomplete. When encountering this error, one should try to resend the request to obtain a complete response.
@NekoAria Thank you for your reply. It's really a problem when the response is not an HTML document.
The possible cause could be that the response returned from the server is empty or incomplete. When encountering this error, one should try to resend the request to obtain a complete response.
But as I stated before, this error occurs even with a non-empty document that can be displayed in a browser correctly.
This happens because the resp.text
cannot be parsed by a utf-8 parser.
I'm not sure why the resp.text
is not utf-8 coded, but I can tell that I'm using Poetry (version 1.4.0) on MacOS 13.3(22E252). I was using e53a6b3 when I meet this problem.
Let me show an example error report with the Yandex module below. (This also happens with other modules)
Maybe type-checking before encoding the request is a good choice.
It looks like the version of the lxml library in your system environment is not up to date.
You can check it by using pip show lxml
.
Please upgrade the lxml library to the latest version by using pip install -U lxml
, and then try running demo_yandex.py
again.
By the way, if you want to develop a reverse image search bot for QQ
, you can take a look at my developed project: YetAnotherPicSearch , which is based on nonebot and this project.
My lxml version is 4.9.2 the latest version by now.
I don't think this is a dependency error since the dependencies are all retrieved from poetry with the file pyproject.toml
.
Thank you for your suggestion of YetAnotherPicSearch. It will be a nice reference for me.
I think this should be an lxml issue, most likely related to your macOS, but I'm not sure.
You can try recompiling lxml locally using the following command to see if the problem is resolved:
pip uninstall lxml
ARCHFLAGS=“-arch arm64” pip install lxml --compile --no-cache-dir
Thank you for your reply.
I'm sure now that it's a problem related to the macOS since the same code works well on Debian 11 and Windows 10.
However, even the recompiled lxml package won't work for macOS, while the system default encoding is utf-8.
The encoding of the response from httpx is also utf-8.
The exception below would be raised under certain circumstances even with a non-empty document.
lxml.etree.ParserError: Document is empty
This can be fixed by simply introducing
encode('utf-8')