postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
MIT License
805 stars 109 forks source link

Is there a way to set Accept-Encoding headers? #43

Closed robfuller closed 2 years ago

robfuller commented 8 years ago

have a site to spider - https://www.logility.com but its failing on: ruby/2.2.0/net/http/response.rb:377:in `inflate': incorrect header check (Zlib::DataError)

If I set Accept-Encoding: plain it should work apparently (it then works via open-uri anyway).

robfuller commented 8 years ago

what I did for now is modify agent.prepare_request to look at the host_headers passed and if its a hash, then use that hash to set the passed header(s), if its not leave the current functionality

unless @host_headers.empty?
    @host_headers.each do |name,header|
      if host.match(name)
        if header.is_a(Hash)
          header.each do |header_name, header_value|
            headers[header_name] = header_value
          end
        else
          headers['Host'] = header
        end
      end
    end
end

I think the polite thing for me to do is somehow propose this as a submit request? I'm not sure how to do that (took me a few minutes to figure out how to create/test a local gem to make sure this worked) - so let me know if you would like me to figure out how to submit this as a change

postmodern commented 8 years ago

Going to need a little more info to isolate the root cause. I'm not sure whether it's the site or Spidr::Agent which is not following HTTP/1.1.

robfuller commented 8 years ago

I've run my code on ~85 sites - only had the issue on https://www.logility.com/

Patching Agent.prepare_request with the above code, and then passing in "Accept-Encoding: plain" did work

So my guess is that the server is doing something wrong with its deflate/gzip, so telling it to not encode works. Since I can't change the server, I needed to address on the scanning side.

robfuller commented 8 years ago

I tried a number of them and they all triggered it (can't say for sure its every url, but I didn't come across any that didn't)

On Wed, Nov 18, 2015 at 8:18 PM, Postmodern notifications@github.com wrote:

Is it every URL on logility.com or a specific URL that triggers it?

— Reply to this email directly or view it on GitHub https://github.com/postmodern/spidr/issues/43#issuecomment-157927077.

postmodern commented 8 years ago

Probably is related to ruby's default headers:

Accept-Encoding: gzip;q=1.0,deflate;q=0.6,identity;q=0.3
Accept: */*
User-Agent: Ruby
robfuller commented 8 years ago

Right, by default ruby accepts gzip and deflate. Its the decompression that is failing.

Setting the accepted encoding to only plain, means no compression.

The problem was there is no way in spidr to send the change to the header (no way to override the ruby default) - the code I shared exposes the ability to set that request header manually.

On Wed, Nov 18, 2015 at 8:30 PM, Postmodern notifications@github.com wrote:

Probably is related to ruby's default headers:

Accept-Encoding: gzip;q=1.0,deflate;q=0.6,identity;q=0.3 Accept: / User-Agent: Ruby

— Reply to this email directly or view it on GitHub https://github.com/postmodern/spidr/issues/43#issuecomment-157930265.

postmodern commented 8 years ago

I wonder if this is a bug in ruby. I may be open to adding another callback to allow setting custom headers. Although, I don't want to change too much just to workaround a bug that may be in Ruby or www.logility.com.

robfuller commented 8 years ago

In searching for that error it has occurred to other libraries as well - so its sort of a bug in ruby I guess. That said, exposing a request header makes sense as there could be a lot of reasons you want to force a specific set.

On Wed, Nov 18, 2015 at 8:39 PM, Postmodern notifications@github.com wrote:

I wonder if this is a bug in ruby. I may be open to adding another callback to allow setting custom headers. Although, I don't want to change too much just to workaround a bug that may be in Ruby or www.logility.com.

— Reply to this email directly or view it on GitHub https://github.com/postmodern/spidr/issues/43#issuecomment-157931474.

postmodern commented 2 years ago

Spidr 0.6.0 added Agent#default_headers which is just a Hash of headers that gets added to every request. Setting agent.default_headers['Accept-Encoding'] = 'plain' or passing in default_headers: {'Accept-Encoding' => 'plain'} would fix this.