Closed bcobb closed 12 years ago
This is interesting. Do you have a URI I can test, and did you check that page.body
contained the HTML? The browser could be testing for a DOCTYPE, when Content-Type
is missing. The other option could be the Web Server is returning empty responses, due to Spidr not having a User-Agent
by default.
The URI that exposed the issue was http://offfurn.com. curl -I http://offfurn.com
shows:
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: fwww=a142366863a0d28237078f5fc6c3894262f4834bfc675afc7a85356742ff9705; Path=/
Date: Mon, 14 May 2012 16:14:23 GMT
Xonnection: close
I have no idea why they send Xonnection
instead of Connection
, but that's a discussion for someone else's issue tracker :smiley:
We do specify User-Agent
when we spider sites, so I don't think it's an issue of missing a user agent:
# UA is our application's User-Agent string
>> size = 0
=> 0
>> Spidr.site('http://offfurn.com', :hosts => [/.*offfurn.com.*/], :user_agent => UA) do |spidr|
spidr.every_page { |page| size = page.body.size }
end
=> #<Spidr::Agent:...>
>> size
=> 31885
page.body
above matches the expected HTML, too. It's not quite equal to the output of curl
but the difference is 5 characters:
% curl http://offfurn.com/ | wc
> 329 1866 31900
That should give a better idea of what we're seeing.
And, for what it's worth, I see those same headers in the Chrome inspector.
If the server is returning non-compliant headers, I think the server is broken. :(
We ran into a scenario where we tried to spider a customer's site for certain keywords -- keywords that were present when we viewed the site in a browser -- but could not locate any of them by using some flavor of
page.search('//body').text.include?(keyword)
. Ultimately,page.search('//body')
returned an empty array because this customer's web server is not returning acontent-type
header, and is thus not being parsed into HTML or XML.What are your thoughts on attempting to parse pages which have no
content-type
header as HTML? This matches the behavior of current web browsers, and at first glance makes this spider more intuitive to use. I'm happy to work on it, but I may be missing a compelling reason to simply ignore such pages.