taganaka / polipus

Polipus: distributed and scalable web-crawler framework
MIT License
92 stars 32 forks source link

Gzip decoded body not used anywhere #20

Closed tmaier closed 10 years ago

tmaier commented 10 years ago

At HTTP#fetch_pages you try to decode the gziped content of a page.

https://github.com/taganaka/polipus/blob/master/lib/polipus/http.rb#L34-L39

          body = response.body.dup
          if response.to_hash.fetch('content-encoding', [])[0] == 'gzip'
            gzip = Zlib::GzipReader.new(StringIO.new(body))    
            body = gzip.read
          end
          pages << Page.new(location, :body          => response.body.dup,

but body is not used anywhere. :body should get it's value.

In general, I'm not sure it this necessary at all, as http://www.ruby-doc.org/stdlib-2.1.1/libdoc/net/http/rdoc/Net/HTTP.html#class-Net::HTTP-label-Compression states this is done by Net::HTTP automatically

taganaka commented 10 years ago

@tmaier You are right. Current implementation is broken.

Gzip content handling from ruby 1.9.3 to 2.1.1 is slightly different and it is still needed to decode content on the fly.

Some reference here: http://stackoverflow.com/questions/13397119/ruby-nethttp-not-decoding-gzip

Let me know what you think about https://github.com/taganaka/polipus/issues/21