Remove the need for `worker_name` to simplify scaling

matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.

https://matrix-org.github.io/synapse

Apache License 2.0

11.79k stars 2.13k forks source link

Remove the need for `worker_name` to simplify scaling #8084

Open ulope opened 4 years ago

ulope commented 4 years ago

Description:

The recent-ish changes to workers now recommend to set a unique worker_name per worker process. This makes scaling the number of workers quite a bit more complex than previously since now every instance needs a tailor made config file.

This is especially annoying when using things like docker-compose, swarm or kubernetes where spinning up multiple (identical) instances of a service is a built-in feature.

I have no concrete proposal on how to solve the problem(s) the worker name solves though. (AFAICS they're only really necessary for reverse mapping for federation senders and stream writers?)

ananace commented 4 years ago

Coming from the Kubernetes angle, there's a common method that's used in there to solve this exact issue, where the internal cluster DNS will generate multi-value records that let you look up all the active pods on a certain service through A and PTR lookups.

Though considering Redis can be used for replication now, perhaps that could be a better solution - albeit one that will make Redis a much harder dependency for Synapse with workers.

ghost commented 3 years ago

A possible solution for Kubernetes would be to use StatefulSets + Environment Variables

Then we need to change this line in the source code
https://github.com/matrix-org/synapse/blob/develop/synapse/config/workers.py#L124

~self.worker_name = config.get("worker_name", self.worker_app)~

self.worker_name = environ.get("HOSTNAME", "i_have_no_env")

P.S. As a temporary solution in the current situation

ananace commented 3 years ago

The issue with using a StatefulSet to get a stable name is that they're designed for running applications where the runtime state is extremely important - things like databases.
Using them for stateless applications like the workers would mean that the cluster can't schedule them as easily, and will in fact make them much more fragile as the cluster will refuse to evict or replace them when necessary - as that'd cause them to lose their "state".

I already see far too many stateless things that start up as STS simply to get stable names.

ulope commented 3 years ago

This also will not help with other solutions like compose, swarm etc.

ghost commented 3 years ago

FAICS they're only really necessary for reverse mapping for federation senders and stream writers?

Could someone make the documentation more detail to clarify this point?

ananace commented 3 years ago

I still feel - even more strongly now - that using Redis for this is the right way to go, that offers a good channel for querying all running workers.