kitUIN / PicImageSearch

整合图片识别 API,用于以图搜源 / Aggregator for Reverse Image Search API
https://pic-image-search.kituin.fun/
MIT License
385 stars 46 forks source link

fix(network): `resp.text` utf-8 encoding #60

Closed AtropsCooper closed 1 year ago

AtropsCooper commented 1 year ago

The exception below would be raised under certain circumstances even with a non-empty document.

lxml.etree.ParserError: Document is empty

This can be fixed by simply introducing encode('utf-8')

NekoAria commented 1 year ago

This is not an effective fix as it introduces new problems. Not all requests return HTML documents, so this approach may lead to errors or exceptions.

This error typically indicates that the HTML document content obtained from resp.text is empty or incomplete, causing the fromstring() method to fail in parsing it as a valid document object and leading to a parsing error.

The possible cause could be that the response returned from the server is empty or incomplete. When encountering this error, one should try to resend the request to obtain a complete response.

AtropsCooper commented 1 year ago

@NekoAria Thank you for your reply. It's really a problem when the response is not an HTML document.

The possible cause could be that the response returned from the server is empty or incomplete. When encountering this error, one should try to resend the request to obtain a complete response.

But as I stated before, this error occurs even with a non-empty document that can be displayed in a browser correctly. This happens because the resp.text cannot be parsed by a utf-8 parser. I'm not sure why the resp.text is not utf-8 coded, but I can tell that I'm using Poetry (version 1.4.0) on MacOS 13.3(22E252). I was using e53a6b3 when I meet this problem.

Let me show an example error report with the Yandex module below. (This also happens with other modules)

Error report Error Report: ``` File "test.py", line 19, in test resp = await yandex.search(url=url) │ │ └ 'https://raw.githubusercontent.com/kitUIN/PicImageSearch/main/demo/images/test06.jpg' │ └ File "/PicImageSearch/yandex.py", line 60, in search return YandexResponse(resp_text, resp_url) │ │ └ 'https://yandex.com/images/search?rpt=imageview&url=https%3A%2F%2Fraw.githubusercontent.com%2FkitUIN%2FPicImageSearch%2Fmain%... │ └ ' │ │ └ ' │ │ └ ' │ └ '
resp.text when this occurs One Drive URL: https://1drv.ms/t/s!AieNcmyB2ZXtgbIyC4Zy0M6uGWifsw?e=s0T8xk

Maybe type-checking before encoding the request is a good choice.

NekoAria commented 1 year ago

It looks like the version of the lxml library in your system environment is not up to date. You can check it by using pip show lxml .

Please upgrade the lxml library to the latest version by using pip install -U lxml , and then try running demo_yandex.py again.

By the way, if you want to develop a reverse image search bot for QQ, you can take a look at my developed project: YetAnotherPicSearch , which is based on nonebot and this project.

AtropsCooper commented 1 year ago

My lxml version is 4.9.2 the latest version by now. I don't think this is a dependency error since the dependencies are all retrieved from poetry with the file pyproject.toml.

Thank you for your suggestion of YetAnotherPicSearch. It will be a nice reference for me.

NekoAria commented 1 year ago

I think this should be an lxml issue, most likely related to your macOS, but I'm not sure.

You can try recompiling lxml locally using the following command to see if the problem is resolved:

pip uninstall lxml
ARCHFLAGS=“-arch arm64” pip install lxml --compile --no-cache-dir
AtropsCooper commented 1 year ago

Thank you for your reply.

I'm sure now that it's a problem related to the macOS since the same code works well on Debian 11 and Windows 10.

However, even the recompiled lxml package won't work for macOS, while the system default encoding is utf-8.

The encoding of the response from httpx is also utf-8.