propublica / upton

A batteries-included framework for easy web-scraping. Just add CSS! (Or do more.)
MIT License
1.62k stars 113 forks source link

relative url edge cases #16

Closed jeremybmerrill closed 11 years ago

jeremybmerrill commented 11 years ago

As @dannguyen notes with respect to fix for #14 / #8 :

should handle the following non-absolute href possibilities: //anothersite.com (keeps scheme, too!) /root/dir relative/dir ?query=2 #bang

Not a priority, and may require some refactoring of resolve_url (but I don't think this will break anything)...

dannguyen commented 11 years ago

Doesn't it already? The tests pass on my machine: https://github.com/propublica/upton/blob/master/spec/unit/utils_spec.rb

jeremybmerrill commented 11 years ago

Heh. I just assumed because of the TODO that it was still undone. Yeah, tests pass for me. blushes

dannguyen commented 11 years ago

Also, for Upton's purposes, it shoudn't matter if it handles hashbangs correctly (hard to think of a case where someone is actively trying to scrape hrefs that may contain a hashbang), but the URI.parse and URI.join methods take care of it by default

jeremybmerrill commented 11 years ago

Fixed the tests for the little changes I made, e.g. moving resolve_url into upton.rb (since I think it's reasonably central to Upton's functionality, whereas the Utils stuff is ancillary).

Yeah, I'm not sure what behavior would be if a link has a hashbang in it, but... maybe there's a case. Glad those methods deal with it for us.