taganaka / polipus

Polipus: distributed and scalable web-crawler framework
MIT License
92 stars 32 forks source link

Support for other charsets than UTF-8 #58

Closed pcboy closed 9 years ago

pcboy commented 9 years ago

I have some issues crawling Japanese websites with SHIFT-JIS encoding. Nokogiri is not doing any automatic charset conversion to UTF-8.

I fixed it by rewriting the Page#doc method and using kconv.

    require 'kconv'

    def doc
      return @doc if @doc

      noko_en_id = {
        Kconv::UTF8 => 'UTF-8',
        Kconv::EUC => 'EUC-JP',
        Kconv::SJIS => 'SHIFT-JIS',
        Kconv::ASCII => 'ASCII',
        Kconv::JIS => 'ISO-2022-JP'
      }[Kconv.guess(@body || '')]

      @doc = Nokogiri::HTML(@body, nil, noko_en_id) if @body && html? rescue nil                                                                            
    end
taganaka commented 9 years ago

@pcboy Thanks for reporting this. Could you please open a pull request so that your patch can be evaluated and tested?

Thanks!