rivermont / spidy

The simple, easy to use command line web crawler.
GNU General Public License v3.0
334 stars 69 forks source link

Save robots.txt results #62

Closed rivermont closed 6 years ago

rivermont commented 6 years ago

Currently, a request is sent for a site's robots.txt every time a link is crawled. It would be much faster if results of a robots.txt query were saved in some database. Only one request should need to be sent.

syre commented 6 years ago

perhaps the result could be stored in a dictionary with urls as keys? if it is only supposed to be stored once per run