mendableai / firecrawl

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
https://firecrawl.dev
GNU Affero General Public License v3.0
19.26k stars 1.5k forks source link

[BUG] The encoding is not correct for some Chinese language sites #547

Open Z-ZHHH opened 3 months ago

Z-ZHHH commented 3 months ago

Describe the Bug The encoding is not correct for some sites, i.e., http://finance.people.com.cn/n1/2024/0816/c1004-40300019.html

To Reproduce

import requests

url = "http://localhost:3002/v0/scrape"

payload = {
    "url": "http://finance.people.com.cn/n1/2024/0816/c1004-40300019.html",
    "pageOptions": {
        "headers": {},
        "includeHtml": True,
        "includeRawHtml": True,
        "waitFor": 1230
    },
    "extractorOptions": {
        "mode": "markdown",
    },
    "timeout": 1230
}
headers = {
    "Authorization": "Bearer <token>",
    "Content-Type": "application/json"
}

response = requests.request("POST", url, json=payload, headers=headers)
print(response.text)

Behavior The Chinese in the result is not correctly encoded: AIGC��Ȼ���Դ�Ϊ����ѵ��һ���������������ݵķ���� image

ArthasWhite commented 3 weeks ago

编码问题,有的中文网页是GB2312的,firecrawl用axios去请求网页,然后直接用的axios转换的字符串(axios自动用utf8转的)就乱码了。如果你有办法通过bytes判断编码方式这问题就能解决,我暂时是没找到好办法。 如果能判断编码方式的话,在apps/api/src/scraper/WebScraper/scrapers/fetch.ts 里面,在 await axios.get的参数里加上 responseType: "arraybuffer" 这样拿到的就是bytes,然后自己判断一下是应该用什么编码(我是卡在这步了),用iconv-lite以对应的方式GB2312还是utf-8去转成str

zheng-hongchen commented 4 days ago

I have the same problem, original page contains Or: <meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

zheng-hongchen commented 3 days ago

编码问题,有的中文网页是GB2312的,firecrawl用axios去请求网页,然后直接用的axios转换的字符串(axios自动用utf8转的)就乱码了。如果你有办法通过bytes判断编码方式这问题就能解决,我暂时是没找到好办法。 如果能判断编码方式的话,在apps/api/src/scraper/WebScraper/scrapers/fetch.ts 里面,在 await axios.get的参数里加上 responseType: "arraybuffer" 这样拿到的就是bytes,然后自己判断一下是应该用什么编码(我是卡在这步了),用iconv-lite以对应的方式GB2312还是utf-8去转成str

简单处理了下字符串找一下编码方式。