jacktuck / unfurl

Metadata scraper with support for oEmbed, Twitter Cards and Open Graph Protocol for Node.js :zap:
MIT License
481 stars 52 forks source link

Non ASCII characters garbled in unfurl response #29

Closed vivekatro closed 6 years ago

vivekatro commented 6 years ago

https://unfurl.now.sh/?url=https://www.yahoo.co.jp AL the Japanese chars are messed up.

jacktuck commented 6 years ago

@vivekatro Good find! I had noticed this before on a similar domain. Would you like to make a pull request :) ?

Jerczu commented 6 years ago

Pretty simple way to fix just use encodeURIComponent() when collecting strings and decodeURIComponent() when displaying them back this should preserve japanese characters.

jacktuck commented 6 years ago

@vivekatro I (think) the issue is over at micro-unfurl see my PR https://github.com/beeman/micro-unfurl/pull/3

beeman commented 6 years ago

Issue fixed by @jacktuck and deployed :)

jacktuck commented 6 years ago

<3 thanks @beeman and @vivekatro

vivekatro commented 6 years ago

Thanks @jacktuck for fixing it quickly!

jacktuck commented 6 years ago

@vivekatro Turns out this was also a bug for unfurl.js - we didn't handle servers responding with multibyte encodings. I noticed this when http://qq.com returned charset as GB2312.

Other encodings are:

Japanese: Shift_JIS, Windows-31j, Windows932, EUC-JP
Chinese: GB2312, GBK, GB18030, Windows936, EUC-CN
Korean: KS_C_5601, Windows949, EUC-KR
Taiwan/Hong Kong: Big5, Big5-HKSCS, Windows950

This PR has been merged and should now be fixed in unfurl.js https://github.com/jacktuck/unfurl/pull/31

And there's an open PR https://github.com/beeman/micro-unfurl/pull/4 which bumps micro-unfurl to use this release.

There is also a tagged release on npm 1.1.7 which you can install with npm install unfurl.js@1.1.7 or npm install unfurl.js@beta

Leaving this open for now until I do a release under latest tag.

jacktuck commented 6 years ago

I need to fix the benchmarks at some point too and see how this fix impacts performance.

beeman commented 6 years ago

image Should be fixed thanks to @jacktuck !

vivekatro commented 6 years ago

It seems like on the new version, the structure of the response has changed, with 1.1.6 for https://www.yahoo.co.jp I get following response,

{
  "other" : {
    "description" : "日本最大級のポータルサイト。検索、オークション、ニュース、天気、スポーツ、メール、ショッピングなど多数のサービスを展開。あなたの生活をより豊かにする「課題解決エンジン」を目指していきます。",
    "robots" : "noodp",
    "googleSiteVerification" : "fsLMOiigp5fIpCDMEVodQnQC7jIY1K3UXW5QkQcBmVs",
    "alternate" : "https://m.yahoo.co.jp/",
    "canonical" : "https://www.yahoo.co.jp/",
    "fbAppId" : "472870002762883",
    "title" : "Yahoo! JAPAN\n",
    "stylesheet" : "//s.yimg.jp/images/top/sp2/clr/180312/1.css"
  },
  "ogp" : {
    "ogTitle" : "Yahoo! JAPAN",
    "ogType" : "article",
    "ogUrl" : "https://www.yahoo.co.jp/",
    "ogImage" : [ {
      "url" : "https://s.yimg.jp/images/top/ogp/fb_y_1500px.png"
    } ],
    "ogDescription" : "日本最大級のポータルサイト。検索、オークション、ニュース、天気、スポーツ、メール、ショッピングなど多数のサービスを展開。あなたの生活をより豊かにする「課題解決エンジン」を目指していきます。",
    "ogSiteName" : "Yahoo! JAPAN"
  },
  "twitter" : {
    "twitterCard" : "summary_large_image",
    "twitterSite" : "@Yahoo_JAPAN_PR",
    "twitterTitle" : "Yahoo! JAPAN",
    "twitterDescription" : "日本最大級のポータルサイト。検索、オークション、ニュース、天気、スポーツ、メール、ショッピングなど多数のサービスを展開。あなたの生活をより豊かにする「課題解決エンジン」を目指していきます。",
    "twitterImage" : [ {
      "url" : "https://s.yimg.jp/images/top/ogp/tw_y_1400px.png"
    } ]
  }
}

But with the latest version, the twitter and ogp sections are not coming.

jacktuck commented 6 years ago

@vivekatro Thanks for this i'll take a look this evening

vivekatro commented 6 years ago

https://unfurl.now.sh/?url=https://www.rakuten.co.jp breaks, I get a timeout error. but it used to work with 1.1.6

jacktuck commented 6 years ago

@vivekatro Default timeout is 2000ms and this site takes 8000ms for me, you can pass your own timeout, though. For example, you can set a timeout of 20 seconds like so: unfurl(url, { timeout: 20 * 1000 } ).

jacktuck commented 6 years ago

Mm actually 1.1.6 used request rather than node-fetch so it had a higher timeout. I'll make a change to match the old timeout i think too or just increase the default to something sensible. Good find.

From request: timeout - integer containing the number of milliseconds to wait for a server to send response headers (and start the response body) before aborting the request. Note that if the underlying TCP connection cannot be established, the OS-wide TCP connection timeout will overrule the timeout option (the default in Linux can be anywhere from 20-120 seconds).

jacktuck commented 6 years ago

Timeout should default to OS limit again now in version 1.1.8-beta.2

vivekatro commented 6 years ago

Were you able to check on yahoo.co.jp unfurling missing ogp and Twitter in the new version?

jacktuck commented 6 years ago

@vivekatro Not yet. I need to make a test demonstrating whats missing etc to make sure it doesn't regress again.

jacktuck commented 6 years ago

@vivekatro Good spot on yahoo.co.jp. Just found the issue which was User-Agent was not being set properly in the new release. It just so happened yahoo will not send you meta tags unless you are a bot, which is why setting User-Agent to facebook's (the default user-agent) works.

I've fixed it on the newest prerelease of unfurl.js@2.x.x (see README first), let me know if you'd like me to patch 1.6.x too, it's trivial to do so. :)