Fail crawling relative url and protocol

rivermont / spidy

The simple, easy to use command line web crawler.

GNU General Public License v3.0

340 stars 69 forks source link

The crawler concat the child's uri relative to the parent : https://mysite/folder/page => found : /js/main.js https://mysite/folder/page/js/main.js

Same thing when a link doesn't have protocol declared : https://mysite/folder/page => found : //subdomain.mysite/images/myimage.png https://mysite/folder/page//subdomain.mysite/images/myimage.png

Install apt-get install python3 python3-lxml python3-requests apt-get install python3-pip python-pip pip3 install spidy-web-crawler

Starting spidy Web Crawler version 1.6.5

Am i the only one with this problem ?

Thx for you help

rivermont / spidy