Exclude HN site from search-engine indexing

sveltejs / sites

Monorepo for the sites in the Svelte ecosystem

https://svelte.dev

MIT License

286 stars 123 forks source link

Exclude HN site from search-engine indexing #533

Closed CaptainCodeman closed 10 months ago

CaptainCodeman commented 10 months ago

As mentioned in discord, having so many non-svelte results show up when searching for svelte related content can negatively impact people's ability to find the content they are looking for

vercel[bot] commented 10 months ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
hn	✅ Ready (Inspect)	Visit Preview		Aug 18, 2023 3:20pm
repl	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Aug 18, 2023 3:20pm

benmccann commented 10 months ago

I probably would have implemented this by adding a robots.txt in the static/ directory. It will save a few bytes of traffic for all our real users and also increase our Lighthouse score, which I believe checks if you have defined a robots.txt. Since this is an example site, it's probably worth trying to be a little extra pedantic in following the practices we'd like to promote

I'm curious, what searches is hn.svelte.dev showing up for where we'd like it to be excluded?

CaptainCodeman commented 10 months ago

The problem with using robots.txt is that it blocks the search engines from being told to remove entries. For a new site it's correct but you need to allow the spiders access to learn that the entries need to be removed (but it could be added after they are)

https://developers.google.com/search/docs/crawling-indexing/block-indexing#:~:text=If%20the%20page%20is%20blocked,other%20pages%20link%20to%20it.

The example given was for "svelte fuzzing", which for me also brings up a totally non-svelte related page from the hn site:

Looks like there are about 8,250 pages indexed from it.

benmccann commented 10 months ago

The problem with using robots.txt is that it blocks the search engines from being told to remove entries.

wtf! well that explains why I haven't gotten svelte.dev/tutorial out of the search results yet (in favor of learn.svelte.dev). I just requested to remove it in the google search console, but I suppose I'd need to do the same for bing, etc. so this is probably still the better solution in that case so that the other search engines are handled