Closed pcboy closed 9 years ago
I have some issues crawling Japanese websites with SHIFT-JIS encoding. Nokogiri is not doing any automatic charset conversion to UTF-8.
I fixed it by rewriting the Page#doc method and using kconv.
require 'kconv' def doc return @doc if @doc noko_en_id = { Kconv::UTF8 => 'UTF-8', Kconv::EUC => 'EUC-JP', Kconv::SJIS => 'SHIFT-JIS', Kconv::ASCII => 'ASCII', Kconv::JIS => 'ISO-2022-JP' }[Kconv.guess(@body || '')] @doc = Nokogiri::HTML(@body, nil, noko_en_id) if @body && html? rescue nil end
@pcboy Thanks for reporting this. Could you please open a pull request so that your patch can be evaluated and tested?
Thanks!
I have some issues crawling Japanese websites with SHIFT-JIS encoding. Nokogiri is not doing any automatic charset conversion to UTF-8.
I fixed it by rewriting the Page#doc method and using kconv.