leafac / kill-the-newsletter

Convert email newsletters into Atom feeds
https://kill-the-newsletter.com
MIT License
2.31k stars 113 forks source link

Don't index `/alternate/` & `/feeds/` #33

Closed Pinjasaur closed 3 years ago

Pinjasaur commented 3 years ago

Howdy @leafac, I stumbled across your project and it looks pretty slick—hoping I can take it for a spin when I get some down time.

On my first-pass I noticed that the /alternate/ route is showing up as indexed by search engines e.g., https://duckduckgo.com/?q=site%3Akill-the-newsletter.com

My first instinct would be that you would want a minimal /robots.txt like so:

User-agent: *
Disallow: /alternate/

You may also want to disallow /feeds/ as well, but I'm not 100% sure on that one. Thoughts?

leafac commented 3 years ago

Great catch. Thanks for opening the issue.

In fact, nothing under /alternate/ or /feeds/ should appear in search results.

It appears that robots.txt isn’t the way to go; what we want is noindex. I’ll look into this tomorrow.

Pinjasaur commented 3 years ago

Excellent point. I would think you would want to utilize the X-Robots-Tag header to prevent indexing.

e.g., set a X-Robots-Tag: noindex, nofollow HTTP header on anything served under /alternate/ and /feeds/.

A bit of searching seems to yield that this is an acceptable approach for "you're welcome to crawl it but it shouldn't show up in search results." I think you could also do both: have a robots.txt that disallows /feeds/ and /alternate/ and set the aforementioned header on the same routes. Non-rhetorical question: what's the reasoning for letting /feeds/ and /alternate/ be crawled if everything is going to be responded to with noindex?

leafac commented 3 years ago

Non-rhetorical question: what's the reasoning for letting /feeds/ and /alternate/ be crawled if everything is going to be responded to with noindex?

From the Google documentation on noindex:

Important: For the noindex directive to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.