postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
MIT License
806 stars 109 forks source link

URL Normalization. #1

Closed justfalter closed 15 years ago

justfalter commented 15 years ago

Hey postmodern,

It seems like the usage of File.expand_path in Spidr::Page#to_absolute can goof up URLs in a very minor way. Observe:

irb(main):001:0> a = '/somedir/' => "/somedir/" irb(main):002:0> File.expand_path(a) => "/somedir"

Imagine that you go to a site 'http://www.foo.com/somedir' ... '/somedir' is a directory, and the server responds with:

HTTP/1.1 301 Moved Permanently Date: Mon, 21 Sep 2009 23:08:19 GMT Server: Apache/2.0.63 (CentOS) Location: http://www.foo.com/somedir/ .....

Requesting 'http://www.foo.com/somedir/' yields HTTP/1.1 200 OK Date: Mon, 21 Sep 2009 23:10:32 GMT Server: Apache/2.0.63 (CentOS) .....

When to_absolute normalizes 'http://www.foo.com/somedir/', it ends up coming out of the method as 'http://www.foo.com/somedir', which it has already visited.

In the real world 'http://www.foo.com/somedir' != 'http://www.foo.com/somedir/' ... File.expand_path doesn't know the difference between the two, but to an HTTP server they are two different things.

~Mike

justfalter commented 15 years ago

Whoa, you might want to reevaluate the usage of File.expand_path, period:

irb(main):001:0> RUBY_VERSION => "1.9.1" irb(main):002:0> File.expand_path('../foo/bar') => "/Users/foo/bar"

(I'm on osx, so it dropped from my home dir to /Users). File.expand_path can take a path as second argument, to which the first will be made relative. It defaults to the current working directory.

So, you could instead do irb(main):005:0> File.expand_path('../foo/bar', '/') => "/foo/bar"

While the purist in me says that a request to http://www.foo.com/../foo/bar is a bad request, it looks like Firefox and Safari will normalize this to http://www.foo.com/foo/bar ..

As for the the trailing slash, you could just check for the slash in the path, set a flag, and tack it back on the end. Or, you could go with http://addressable.rubyforge.org , which is a much more powerful replacement for the URI built in code.

postmodern commented 15 years ago

Thank you for spotting this. After doing some basic tests, it appears the Addressable library handles all the URL normalization issues that Spidr has ran into in the past. I'm going to continue playing with Addressable::URI and see if I can integrate it into Page#to_absolute.

postmodern commented 15 years ago

Page#to_absolute has been replaced by Page#normalize_link and Page#normalize_path. Specs were also added for #normalize_path, to ensure that trailing '/' characters are preserved.