taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

Non-ASCII characters are parsed incorrectly. #247

Closed kotoshii closed 1 year ago

kotoshii commented 1 year ago

When I'm trying to access non-ascii text (cyrillic in my case) via innerText, innerHTML or getAttribute() I'm getting something like "������ ������� (����)" every time. Am I doing something wrong or the library does not support non-ascii characters?

kotoshii commented 1 year ago

The problem was with html response encoding (charset=windows-1251). For anyone coming from google, I used the following solution:

const res = await axios.get(pageUrl, {
  responseType: "arraybuffer",
  responseEncoding: "binary",
});
const parsedPage = parse(
  iconv.decode(Buffer.from(res.data), "windows-1251").toString(),
);