[BUG] 中文网站乱码 Chinese website messy code

lexiforest / curl_cffi

Python binding for curl-impersonate fork via cffi. A http client that can impersonate browser tls/ja3/http2 fingerprints.

https://curl-cffi.readthedocs.io/

MIT License

2.34k stars 255 forks source link

[BUG] 中文网站乱码 Chinese website messy code #336

Closed zyoung1212 closed 4 months ago

zyoung1212 commented 4 months ago

curl_cffi version 0.6.4

请问中文网站乱码，有什么通用的解决办法吗？（即不通过手动指定编码） Can you tell me if there is any general solution for Chinese websites with messy codes? (i.e. not by specifying the encoding manually)

zyoung1212 commented 4 months ago

目前我的解决办法是，先用requests的编码检测，但这样请求了两次：

@app.post("/fetch_html", response_class=HTMLResponse)
def fetch_html(request: UrlRequest):
    try:
        request_res = requests.get(request.url, proxies={"http": request.proxy})
        new_encoding = request_res.apparent_encoding
        response = cffi_requests.get(str(request.url), proxy=request.proxy, impersonate="chrome120")
        # 检查响应状态码
        if response.status_code != 200:
            raise HTTPException(status_code=response.status_code, detail="Failed to fetch the URL")
        response.encoding = new_encoding
        # 返回HTML内容
        return HTMLResponse(content=response.text, status_code=200)
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"An error occurred: {str(e)}")

zyoung1212 commented 4 months ago

能否将requests的apparent_encoding移过来呢

perklet commented 4 months ago

requests 用了 chardet 这个库来自动检测编码，但是根据我以前的经验，这个库有时候会 hang 住，特别慢。你也可以用更快一点的 cchardet，但是这个库好像不支持 3.10+。这俩依赖我都不倾向于加。

>>> from curl_cffi import requests
>>> r = requests.get("https://news.c<DELETEME>readers.net/china/2024/06/19/2743922.html")
>>> import chardet
>>> chardet.detect(r.content)
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}

就你这个网站而言，他的编码信息放在了 html head 里，倒是可以考虑加上这个检测。

<meta http-equiv="Content-Type" content="text/html; charset=gbk" />

PS，你这搞得网站很危险啊，墙内的话，注意自己安全。

zyoung1212 commented 4 months ago

requests 用了 chardet 这个库来自动检测编码，但是根据我以前的经验，这个库有时候会 hang 住，特别慢。你也可以用更快一点的 cchardet，但是这个库好像不支持 3.10+。这俩依赖我都不倾向于加。
>>> from curl_cffi import requests
>>> r = requests.get("https://news.c<DELETEME>readers.net/china/2024/06/19/2743922.html")
>>> import chardet
>>> chardet.detect(r.content)
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
就你这个网站而言，他的编码信息放在了 html head 里，倒是可以考虑加上这个检测。
<meta http-equiv="Content-Type" content="text/html; charset=gbk" />
PS，你这搞得网站很危险啊，墙内的话，注意自己安全。

非常感谢！经提醒，已删除相关链接

zyoung1212 commented 4 months ago

cchardet

meta标签的我记得很久前遇到过一个中文网站，它meta里面是gbk，但实际需要utf-8才正确。具体哪个忘记了。所以暂时没考虑。

perklet commented 4 months ago

meta标签的我记得很久前遇到过一个中文网站，它meta里面是gbk，但实际需要utf-8才正确。具体哪个忘记了。所以暂时没考虑。

那就是网站的问题了，这不是开左灯往右转么。