Closed Pinjasaur closed 3 years ago
Great catch. Thanks for opening the issue.
In fact, nothing under /alternate/
or /feeds/
should appear in search results.
It appears that robots.txt
isn’t the way to go; what we want is noindex
. I’ll look into this tomorrow.
Excellent point. I would think you would want to utilize the X-Robots-Tag
header to prevent indexing.
e.g., set a X-Robots-Tag: noindex, nofollow
HTTP header on anything served under /alternate/
and /feeds/
.
A bit of searching seems to yield that this is an acceptable approach for "you're welcome to crawl it but it shouldn't show up in search results." I think you could also do both: have a robots.txt
that disallows /feeds/
and /alternate/
and set the aforementioned header on the same routes. Non-rhetorical question: what's the reasoning for letting /feeds/
and /alternate/
be crawled if everything is going to be responded to with noindex
?
Non-rhetorical question: what's the reasoning for letting
/feeds/
and/alternate/
be crawled if everything is going to be responded to withnoindex
?
From the Google documentation on noindex
:
Important: For the noindex directive to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.
Howdy @leafac, I stumbled across your project and it looks pretty slick—hoping I can take it for a spin when I get some down time.
On my first-pass I noticed that the
/alternate/
route is showing up as indexed by search engines e.g., https://duckduckgo.com/?q=site%3Akill-the-newsletter.comMy first instinct would be that you would want a minimal
/robots.txt
like so:You may also want to disallow
/feeds/
as well, but I'm not 100% sure on that one. Thoughts?