Closed jokester closed 4 months ago
The code used chardet(buffer)
as fallback, but given the result I think it didn't guess well.
// lib/fallback.ts
ogObject.charset = chardet.detect(Buffer.from(body)) || '';
I've never seen the http-equiv
meta tag before. I can add a fallback to this later in the week.
Another detail I found during debugging is, I guess there is some cornerer case where we cannot just read text with body = await res.text()
, and feed it to chardet.detect(Buffer.from(body))
.
At least for this specific webpage Buffer.from(await res.arraybuffer())
and Buffer.from(await res.text())
gave different bytes . Maybe res.text()
lacked correct encoding or was not for this purpose.
This is a gist to show the difference of bytes and chardet.analyze()
: https://gist.github.com/jokester/937c43eb8918e141ef43dc320f38b8d8
In my use case I managed to detect encoding, convert the bytes, and use openGraphScraper({html})
to get what I needed.
Considering the tricky things in encoding problem I guess it's hard to do a perfect fix. The API was flexible enough to allow my workaround 👍🏽 .
I've updating the charset fallback in open-graph-scraper@6.3.2
. I'm also getting weird/different results between Buffer.from(await res.arraybuffer())
and Buffer.from(await res.text())
for this page, but other ShiftJIS
pages seem to work just fine. Are you seeing this issue with other sites?
Sorry I don't have other similar cases at hand. Thanks for the fix, it should make this library more complete 👍🏽
I had another look at "corrupted" ShiftJIS text in gist. In the suspicious res.text()
bytes, a lot of Japanese characters are replaced by U+FFFD "Replacement Character".
@jshemas @jokester
https://github.com/jshemas/openGraphScraper/pull/206
I'm working on this issue.
I hope that users of openGraphScraper won't have to worry about character sets. Therefore, I will suggest implementing a feature to check the character set and decode it to UTF-8 when fetching a website.
@jokester @cm-dyoshikawa fix is live in open-graph-scraper@6.4.0
!
With this change, users of openGraphScraper should no longer need to be aware of character encodings. This will be very useful since I am in a Japanese-speaking country and still have Shift_JIS sites. Thank you.
Describe the bug
OpenGraphScrapter v6.3.0 couldn't detect charset from a webpage I saw.
The page had
<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">
and maybe no other clue.To Reproduce
Expected behavior
Actual behavior
Screenshots If applicable, add screenshots to help explain your problem.
Additional context Add any other context about the problem here.