matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.82k stars 2.13k forks source link

Link preview should respect robots.txt #3242

Open 26000 opened 6 years ago

26000 commented 6 years ago

Description

When we send a link in Matrix, the client can use the Synapse's integrated prefetcher to fetch the link preview. It though doesn't check if the website allows bots there and crawls the page regardlessly.

We need to have a User-agent for Synapse and to parse robots.txt at the root of the domain user wanted to preview, if access for Synapse is denied, do not visit the URL.

richvdh commented 6 years ago

I really thought we had this issue already, but I can't find it...

ptman commented 6 years ago

Probably just this TODO: https://github.com/matrix-org/synapse/blob/aafb0f6b0d7db313ac54a8e5e933970feae4bff3/synapse/rest/media/v1/preview_url_resource.py#L281