rivermont / spidy

The simple, easy to use command line web crawler.
GNU General Public License v3.0
340 stars 69 forks source link

Fail crawling relative url and protocol #73

Closed iboutillier closed 3 years ago

iboutillier commented 5 years ago

The crawler concat the child's uri relative to the parent : https://mysite/folder/page => found : /js/main.js https://mysite/folder/page/js/main.js

Same thing when a link doesn't have protocol declared : https://mysite/folder/page => found : //subdomain.mysite/images/myimage.png https://mysite/folder/page//subdomain.mysite/images/myimage.png

Install apt-get install python3 python3-lxml python3-requests apt-get install python3-pip python-pip pip3 install spidy-web-crawler

Starting spidy Web Crawler version 1.6.5

Am i the only one with this problem ?

Thx for you help

michaelnoguera commented 5 years ago

I'm having the same problem. When my crawler reaches a relative link, instead of going from http://example.com/index.html to http://example.com/about.html, it attempts to go to http://example.com/index.html/about.html, resulting in errors.