Spidering pages with no content-type header

bcobb commented 12 years ago

We ran into a scenario where we tried to spider a customer's site for certain keywords -- keywords that were present when we viewed the site in a browser -- but could not locate any of them by using some flavor of page.search('//body').text.include?(keyword). Ultimately, page.search('//body') returned an empty array because this customer's web server is not returning a content-type header, and is thus not being parsed into HTML or XML.

What are your thoughts on attempting to parse pages which have no content-type header as HTML? This matches the behavior of current web browsers, and at first glance makes this spider more intuitive to use. I'm happy to work on it, but I may be missing a compelling reason to simply ignore such pages.

postmodern commented 12 years ago

This is interesting. Do you have a URI I can test, and did you check that page.body contained the HTML? The browser could be testing for a DOCTYPE, when Content-Type is missing. The other option could be the Web Server is returning empty responses, due to Spidr not having a User-Agent by default.

bcobb commented 12 years ago

The URI that exposed the issue was http://offfurn.com. curl -I http://offfurn.com shows:

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: fwww=a142366863a0d28237078f5fc6c3894262f4834bfc675afc7a85356742ff9705; Path=/
Date: Mon, 14 May 2012 16:14:23 GMT
Xonnection: close

I have no idea why they send Xonnection instead of Connection, but that's a discussion for someone else's issue tracker :smiley:

We do specify User-Agent when we spider sites, so I don't think it's an issue of missing a user agent:

# UA is our application's User-Agent string
>> size = 0
=> 0
>> Spidr.site('http://offfurn.com', :hosts => [/.*offfurn.com.*/], :user_agent => UA) do |spidr| 
     spidr.every_page { |page| size = page.body.size }
   end
=> #<Spidr::Agent:...>
>> size
=> 31885

page.body above matches the expected HTML, too. It's not quite equal to the output of curl but the difference is 5 characters:

% curl http://offfurn.com/ | wc
> 329    1866   31900

That should give a better idea of what we're seeing.

bcobb commented 12 years ago

And, for what it's worth, I see those same headers in the Chrome inspector.

postmodern commented 12 years ago

If the server is returning non-compliant headers, I think the server is broken. :(

postmodern / spidr

Spidering pages with no content-type header #32