Open nrjones8 opened 4 years ago
that particular one got a 403, probably b/c the request came from python/requests
user agent. maybe make a custom user agent that points back at this repo?
see https://webmasters.stackexchange.com/questions/6205/what-user-agent-should-i-set
sometimes hostnames that previously had valid robots.txts are now writing out "Got an HTML response," maybe we just need to retry? or maybe the logic for figuring out if it's a valid robots.txt file isn't so great
e.g. https://github.com/nrjones8/robots-dot-txt-archive-bot/blob/8434ddbe9900fc94c914a52b6b55003f604eebea/data/cleaned/dotgov_domains/kittyhawknc.gov