nrabinowitz / pjscrape

A web-scraping framework written in Javascript, using PhantomJS and jQuery
http://nrabinowitz.github.io/pjscrape/
MIT License
996 stars 159 forks source link

getAnchorUrls does not work on href="//foo.com" links #46

Closed Uelb closed 10 years ago

Uelb commented 10 years ago

Sometimes, an a tag contains an href like "//www.foo.com/bar" meaning that the protocol of the link is the same as the current window but it is indeed not a relative url but an absolute one.

On the youtube home page

<a href="//www.youtube.com/upload">...</a>

pjscrape function will get the url as http://www.foo.com//www.foo.com/bar leading to a 404 error when the page is visited.