superseriousbusiness / gotosocial

Fast, fun, small ActivityPub server.
https://docs.gotosocial.org
GNU Affero General Public License v3.0
3.76k stars 322 forks source link

[feature] Make `robots.txt` and `noindex` customizable #776

Open tsmethurst opened 2 years ago

tsmethurst commented 2 years ago

Right now, all GtS instances serve a simple hardcoded robots.txt that disallows all crawling:

User-agent: *
Disallow: /

The code for this is here: https://github.com/superseriousbusiness/gotosocial/blob/main/internal/api/security/robots.go

There are a couple problems with this though.

Firstly, this isn't actually enough to prevent sites from appearing in Google results, it just means that Google shows no information for that site. For example:

Screenshot from 2022-08-29 12-44-32

Here, Google still has the site indexed, it just hasn't crawled the page to gather information, leading to this 'stub' search result entry which is not particularly useful.

Secondly, some users and instances might actually want their profile or instance to be indexed by search engines, and by hardcoding this blanket rejection robots.txt, we're not allowing them that option.

Instead of serving this hardcoded robots.txt, we should allow instance admins and users to choose whether their stuff is indexable (and retain 'no indexing' as the default).

To do this, we should use targeted noindex meta tags instead: https://developers.google.com/search/docs/advanced/crawling/block-indexing. For users, we can use the 'discoverable' field of their account to decide whether to inject this header or not in web views of their pages and statuses.

For instance pages, we'll have to think of something else.

tsmethurst commented 2 years ago

Partially resolved by https://github.com/superseriousbusiness/gotosocial/pull/842 but we need a way for instance admins to set discoverable on the instance as a whole: using the Discoverable field of the instance account perhaps?