let4be / crusty

Broad Web Crawler
GNU General Public License v3.0
83 stars 3 forks source link

Improve robots.txt support #33

Closed let4be closed 3 years ago

let4be commented 3 years ago

We probably should always try to download robots.txt before we access index page... and if it gets resolved with 4xx or 5xx codes we should act accordingly and follow google best practices https://developers.google.com/search/docs/advanced/robots/robots_txt

right now we download / && /robots.txt in parallel and external links from / most likely will be added to the Q(not internal though)

let4be commented 3 years ago

We should also probably make sure that we never hit the same task from different IPs, right now the default concurrency per task is two and we select local IP to bind to(if multiple were provided) randomly