matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.82k stars 2.13k forks source link

redacted #4565

Closed ghost closed 2 years ago

ghost commented 5 years ago

redacted

ara4n commented 5 years ago

As you can see from the timeline here this issue is very much on our radar, although we haven't fed back on the points raised here which we've fixed (oops).

All joins and leaves in every room are stored, these entries consist of user_id, access_token, device_id, ip, user_agent, last_seen, timestamp.

We no longer store access_token, device_id, user_agent, last_seen for users (unless that user's session is still active), as of https://github.com/matrix-org/synapse/pull/6098.

It's inevitable that we track the user_id and timestamp for when users join/leave rooms in order for the room history to actually function. MSC1228 will help us obfuscate the user_id however and is coming shortly.

Logs contain vast amount of information.

We have always gone to great lengths to avoid logging any sensitive data (e.g. message contents, secrets, key data etc) in logs. However, log lines do include user IDs and room IDs required to trace problems. Synapse doesn't run in a log minimisation configuration by default because it's still not stable enough to run unattended by itself, flying blind. We need the logs to help people out when things break. As soon as we hit a sufficient level of stability we'll change the default log level for sure (and we are headed in that direction).

Remove anything that isn't absolutely necessary from them and either implement a user-friendly mechanism (or documentation) to manage them, purge them automatically after a short period of time (fe. 7 days) or don't store them at all.

Synapse doesn't dictate how you store your logs or what retention scheme you apply. Each package of Synapse does it differently (systemd; python logging; docker logs etc), and it's up to the sysadmin to specify the log rotation & retention policy. They can also switch the log level if they want to WARN, which hides all PII.

Other things like redacted and deleted events, accounts, sent files.

Redacted/deleted events now get pruned after N days as of https://github.com/matrix-org/synapse/pull/5934. Deleting files referenced by redacted events is harder, but we're working on it.

3nprob commented 2 years ago

@ghost Why closed?Logs still seem relevant

ghost commented 1 year ago

Hello Could somebody reopen the issue, maybe @ara4n ?, sorry for the ping in advance, but I still believe this issue is relevant today. On the other hand, is this issue a meta issue tracking it on every component (like the matrix spec, synapse, element...) or just the Synapse part of it? I'm asking because I don't think we have a tracking list for this, and since this is a complex issue, maybe we should. I can make a list if you want to and post it here.

3nprob commented 1 year ago

@NebulaOnion I think you can feel free to reopen this as a new issue (rather than yak-shaving it in a thread here).

FWIW if you want to reuse, penultimate version of this issue:

Currently synapse (and AFAIK the whole Matrix ecosystem) doesn't attempt to minimize metadata gathering in any way. This is one of it's biggest issues in terms of security and privacy. This makes Matrix to not be a sensible option for people who care about these values and they have to choose between privacy/security and decentralization/modern FOSS protocol and I think the latter values are significantly less important. In next few weeks Matrix should get to the state where there's bandwidth available to make these basic things right and only then work on things of less importance like new features, app rewrites and dendrite. I think it's a good strategy to first make the base robust and only then move further.

Incomplete list of unnecessary data gathered by synapse:

- Database stores unnecessary information. All joins and leaves in every room are stored, these entries consist of user_id, access_token, device_id, ip, user_agent, last_seen, timestamp. There's most likely more. These should be truncated to only contain information that is truly necessary and shouldn't be stored longer than necessary.
- Logs contain vast amount of information. Remove anything that isn't absolutely necessary from them and either implement a user-friendly mechanism (or documentation) to manage them, purge them automatically after a short period of time (fe. 7 days) or don't store them at all. Logs in production releases of synapse shouldn't contain debugging information, but only information required for security reasons, fe. audit after a breach and with guidance in documentation on how to secure this data up while minimizing metadata retention.
- Other things like redacted and deleted events, accounts, sent files.

I didn't investigate this thoroughly and there's likely more, if you know of anything else, don't forget to share in comments.

Since synapse requires other services for operation like reverse proxy, coturn and postgres (i'm not sure if python or anything else logs anything), this should also be dealt with. Either by removing these dependencies or by crafting a good documentation together with tools that will enable even a person without an infosec and sysadmin background to be able to set it up easily, properly and fast using only that documentation to learn. This is particularly important as Matrix aims to have a well balanced ecosystem of smaller servers avoiding the common problem of federation.

Users should be sufficiently and visibly informed in the documentation of anything that is stored and about possible options to modify this behavior, fe. log removal and how should it be done.

Like Arathorn mentioned, parts of that are no longer relevant.