Closed let4be closed 3 years ago
We probably should always try to download robots.txt before we access index page... and if it gets resolved with 4xx or 5xx codes we should act accordingly and follow google best practices https://developers.google.com/search/docs/advanced/robots/robots_txt
right now we download / && /robots.txt in parallel and external links from / most likely will be added to the Q(not internal though)
/
/robots.txt
We should also probably make sure that we never hit the same task from different IPs, right now the default concurrency per task is two and we select local IP to bind to(if multiple were provided) randomly
We probably should always try to download robots.txt before we access index page... and if it gets resolved with 4xx or 5xx codes we should act accordingly and follow google best practices https://developers.google.com/search/docs/advanced/robots/robots_txt
right now we download
/
&&/robots.txt
in parallel and external links from/
most likely will be added to the Q(not internal though)