nrabinowitz / pjscrape

A web-scraping framework written in Javascript, using PhantomJS and jQuery
http://nrabinowitz.github.io/pjscrape/
MIT License
996 stars 159 forks source link

_pjs.toFullUrl can't properly handle //domain.tld #62

Closed Marooned-MB closed 9 years ago

Marooned-MB commented 9 years ago

Links to //domain.tld/ are quite common to deal with http/https protocol. They should be expanded to http://domain.tld/ on http site or https://domain.tld/ on https site.

_pjs.toFullUrl() expand such links into base//domain.tld/ which is of course wrong.

Simple real world example: http://en.wikipedia.org/wiki/A Check the footer for //wikimediafoundation.org/ and //www.mediawiki.org/ which are expanded to http://en.wikipedia.org//wikimediafoundation.org/ and http://en.wikipedia.org//www.mediawiki.org/.

Marooned-MB commented 9 years ago

Hmm, just found out that this is duplicate for #46 but that issue is closed even if the problem still exists.

Marooned-MB commented 9 years ago

OK, all is clear. I got v0.1.4 from http://nrabinowitz.github.io/pjscrape/ which is outdated. Sorry for confusion :)