Respect robots.txt - Githubissues

rivermont / spidy

The simple, easy to use command line web crawler.

GNU General Public License v3.0

340 stars 69 forks source link

Respect robots.txt #30

Closed rivermont closed 7 years ago

rivermont commented 7 years ago

There should be an option (disable-able) to ignore links that are forbidden by a site's robots.txt. Another library or huge regex might need to be used to parse out the domain the page is on.