supabase / realtime

Broadcast, Presence, and Postgres Changes via WebSockets
https://supabase.com/realtime
Apache License 2.0
6.57k stars 288 forks source link

`env.sh.eex` is preventing the ability to run multiple self-hosted realtime instances in a cluster #1075

Open Towerful opened 2 weeks ago

Towerful commented 2 weeks ago

Bug report

Describe the bug

Currently https://github.com/supabase/realtime/blob/cd04f2f744834296b5a4b3e360e95c3fab5f9165/rel/env.sh.eex is preventing any way of running the Postgres (or any other) Cluster Strategy.

None of the specific cases can be met/configured for a selfhosted instance. It is difficult/impossible/fragile to get the ip variable to actually configure, so it falls back to 127.0.0.1
This produces cluster attempt logs such as SYN[realtime@127.0.0.1], and the cluster strategy breaks

To Reproduce

I'm moving a lot of this over to a local k8s cluster, so this reproduction steps may not be as clear as they should be.

I think the supabase docker compose file could be tweaked with CLUSTER_STRATEGIES=POSTGRES to try and get the cluster strategy to work.
The realtime config will have to be duplicated to run 2 instances, as the realtime containers with broken cluster strategy will fight over the same replication slot so SLOT_NAME_SUFFIX will need to be unique to each container

Both containers will connect to their respective replication slots, and will handle postgres realtime updates fine.
However, broadcast between the instances will not work (broadcast will only work within an instance).
No idea how to direct traffic between the 2 instances (previously, I have used external HAProxy. k8s handles that automatically as a service)

This is because the first step of env.sh.eex is to try and extract the instance's IP address from etc/hosts. If it doesnt exactly match the fly.io config, it will fail to an empty string.
Later on - as ip is an empty string and no other conditions are met - it defaults to 127.0.0.1

Expected behavior

A way to set ip, RELEASE_DISTRIBUTION , RELEASE_NODE manually allowing for more advanced selfhosters to configure the clustering. Perhaps some additional logging about this could be helpful

Additional Context

I commented on the issue https://github.com/supabase/realtime/issues/760 (specifically https://github.com/supabase/realtime/issues/760#issuecomment-2105893864 ) regarding this, with a fix that is working for me.

This includes logs of the cluster strategy working between multiple instances, with lots of things like Node testing@10.244.1.101 has joined the cluster, sending discover message which are completely absent when it fails to configure an IP and falls back to using 127.0.0.1 ip addresses.

I have rebuilt the image "internally" with this new env.sh.eex and have been using & testing it. There is now only 1 replication slot being used, and broadcast between instances appears to be working correctly.

I'm not great with bash and I can't test this within your environment. I also suck at GH pull requests etc, so I'll let you form the final fix for this :)

Again, sorry for the poor bug report, but hopefully it is enough

filipecabaco commented 1 week ago

Hi @Towerful thank you for reporting.

Good catch! I will try to clean up this to improve our self hosting.

Do you want to open the PR so we can work together to fix the issue?

Towerful commented 1 week ago

I presume I would have to fork, update my fork, then open a PR from that?
Like I said, I'm not great with git/github - but I can look into it. I'm more than happy for you to do it if its easier, I'm not bothered about attribution/etc

filipecabaco commented 1 week ago

Ok then I can tackle it 👍 I will ping you to also check the PR