jshemas / openGraphScraper

Node.js scraper service for Open Graph Info and More!
MIT License
643 stars 102 forks source link

Fail to detect charset from certain ShiftJIS page #199

Closed jokester closed 4 months ago

jokester commented 8 months ago

Describe the bug

OpenGraphScrapter v6.3.0 couldn't detect charset from a webpage I saw.

The page had <meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS"> and maybe no other clue.

To Reproduce

const openGraphScrapter = require('open-graph-scraper')

openGraphScrapter({url: 'http://abehiroshi.la.coocan.jp/'}).then(result => console.debug(result) )

Expected behavior

      result: {
        ogTitle: '阿部 寛のホームページ',  // Not very confident on this. Would openGraphScrapter convert it if correct encoding was extracted?
        charset: 'ShiftJIS',
        requestUrl: 'http://abehiroshi.la.coocan.jp/',
        success: true
      },

Actual behavior

      result: {
        ogTitle: '�������̃z�[���y�[�W',
        charset: 'UTF-8',
        requestUrl: 'http://abehiroshi.la.coocan.jp/',
        success: true
      },

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

jokester commented 8 months ago

The code used chardet(buffer) as fallback, but given the result I think it didn't guess well.

// lib/fallback.ts
    ogObject.charset = chardet.detect(Buffer.from(body)) || '';
jshemas commented 8 months ago

I've never seen the http-equiv meta tag before. I can add a fallback to this later in the week.

jokester commented 8 months ago

Another detail I found during debugging is, I guess there is some cornerer case where we cannot just read text with body = await res.text(), and feed it to chardet.detect(Buffer.from(body)).

At least for this specific webpage Buffer.from(await res.arraybuffer()) and Buffer.from(await res.text()) gave different bytes . Maybe res.text() lacked correct encoding or was not for this purpose.

This is a gist to show the difference of bytes and chardet.analyze(): https://gist.github.com/jokester/937c43eb8918e141ef43dc320f38b8d8

jokester commented 8 months ago

In my use case I managed to detect encoding, convert the bytes, and use openGraphScraper({html}) to get what I needed.

Considering the tricky things in encoding problem I guess it's hard to do a perfect fix. The API was flexible enough to allow my workaround 👍🏽 .

jshemas commented 8 months ago

I've updating the charset fallback in open-graph-scraper@6.3.2. I'm also getting weird/different results between Buffer.from(await res.arraybuffer()) and Buffer.from(await res.text()) for this page, but other ShiftJIS pages seem to work just fine. Are you seeing this issue with other sites?

jokester commented 8 months ago

Sorry I don't have other similar cases at hand. Thanks for the fix, it should make this library more complete 👍🏽

jokester commented 8 months ago

I had another look at "corrupted" ShiftJIS text in gist. In the suspicious res.text() bytes, a lot of Japanese characters are replaced by U+FFFD "Replacement Character".

cm-dyoshikawa commented 5 months ago

@jshemas @jokester

https://github.com/jshemas/openGraphScraper/pull/206

I'm working on this issue.

I hope that users of openGraphScraper won't have to worry about character sets. Therefore, I will suggest implementing a feature to check the character set and decode it to UTF-8 when fetching a website.

jshemas commented 4 months ago

@jokester @cm-dyoshikawa fix is live in open-graph-scraper@6.4.0 !

cm-dyoshikawa commented 4 months ago

With this change, users of openGraphScraper should no longer need to be aware of character encodings. This will be very useful since I am in a Japanese-speaking country and still have Shift_JIS sites. Thank you.