Parser not getting sitemap from robots.txt

temoto / robotstxt

The robots.txt exclusion protocol implementation for Go language

MIT License

269 stars 55 forks source link

Parser not getting sitemap from robots.txt #33

Closed TheUltimateCookie closed 2 years ago

TheUltimateCookie commented 2 years ago

For this URL - https://www.zendesk.com/robots.txt, it contains a sitemap, but the data is not getting collected.

Here's my code

resp, err := http.Get("https://www.zendesk.com/robots.txt")
if err != nil {
    log.Fatal(err)
}

robots, err := robotstxt.FromResponse(resp)
resp.Body.Close()
if err != nil {
    log.Fatal(err)
}

Collected Info &{map[] true false []}

Please advice

temoto commented 2 years ago

@TheUltimateCookie please try logging response code and body. Likely, you are hitting CDN (Cloudflare) captcha. Isn't it ironic, that Zendesk thinks you must be human to read robots.txt? :-)

TheUltimateCookie commented 2 years ago

@temoto Oh yes, just noticed that I'm getting 403 for both their robots.txt and their sitemap. Only accessible if I visit as human. Thank you for your time