nrjones8 / robots-dot-txt-archive-bot

A project to collect, archive, and publish robots.txt files from across the internet - with a focus on government websites
https://robots-dot-txt-db.com/
6 stars 0 forks source link

investigate HTML responses, maybe add retries #10

Open nrjones8 opened 4 years ago

nrjones8 commented 4 years ago

sometimes hostnames that previously had valid robots.txts are now writing out "Got an HTML response," maybe we just need to retry? or maybe the logic for figuring out if it's a valid robots.txt file isn't so great

e.g. https://github.com/nrjones8/robots-dot-txt-archive-bot/blob/8434ddbe9900fc94c914a52b6b55003f604eebea/data/cleaned/dotgov_domains/kittyhawknc.gov

nrjones8 commented 4 years ago

that particular one got a 403, probably b/c the request came from python/requests user agent. maybe make a custom user agent that points back at this repo?

see https://webmasters.stackexchange.com/questions/6205/what-user-agent-should-i-set