grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.42k stars 211 forks source link

Alloy 1.3.0 cluster mode fails to start with new cluster on non-Kubernetes platform #1441

Closed tonyswu closed 2 months ago

tonyswu commented 3 months ago

What's wrong?

We are running Alloy on AWS ECS, and since 1.3.0 Alloy has failed to start with --cluster.enabled=true when it's a brand new cluster and service discovery record doesn't yet exist:

fatal error: failed to get peers to join at startup - this is likely a configuration error” service=cluster err=“static peer discovery: failed to find any valid join addresses

This is the result of this change (https://github.com/grafana/alloy/commit/e4aabe31b4862453bb7e8f2df590b58d0be912be) and is worked around by publishNotReadyAddresses: true introduced in https://github.com/grafana/alloy/pull/1423. This effective renders cluster mode unusable for any container platform other than Kubernetes.

Steps to reproduce

Deploy Alloy cluster on any container platform other than Kubernetes with service discovery and --cluster.enabled=true.

System information

AWS ECS, Amazon Linux 2023

Software version

Grafana Alloy 1.3.0

Configuration

[
        "run",
        "--disable-reporting=true",
        "--cluster.enabled=true",
        "--cluster.join-addresses=<SERVICE_DISCOVERY_RECORD>",
        "--cluster.max-join-peers=0",
        "--cluster.name=<CLUSTER_ID>",
        "--server.http.listen-addr=0.0.0.0:<PORT>",
        "--storage.path=/data/alloy",
        "/etc/config.alloy"
]

Logs

ts=2024-08-07T20:02:27.745041327Z level=error msg=“fatal error: failed to get peers to join at startup - this is likely a configuration error” service=cluster err=“static peer discovery: failed to find any valid join addresses: failed to extract host and port: address alloy-cluster.services.internal: missing port in address\nfailed to resolve SRV records: lookup alloy-cluster.services.internal on 10.104.96.2:53: no such host”
thampiotr commented 3 months ago

Hi @tonyswu, thanks for raising this. I'm sorry that you ran into issues.

When you provide --cluster.join-addresses, we expect that there is an existing cluster to join. When the instance fails to join that cluster it would previously create its own, new cluster. But now, with recent changes, it will instead fail hard with the error you see. We made this change as some users would run into an issue where their cluster peers discovery fails on rollout and their cluster of N instances becomes N clusters of 1 instance.

You can bootstrap a new cluster following the recommendation from this doc page:

The first node that is used to bootstrap a new cluster (also known as the “seed node”) can either omit the flags that specify peers to join or can try to connect to itself.

For example, I can start a seed node locally and instruct it to join itself with:

$ alloy run config.alloy --server.http.listen-addr 127.0.0.1:8001 --cluster.enabled --cluster.node-name instance1 --cluster.advertise-address 127.0.0.1:8001 --cluster.join-addresses 127.0.0.1:8001

And then I can start more peers and tell them to join the cluster we have bootstrapped above, using any discovery mechanism as long as the seed node can be correctly discovered.

Having said that, I can see how this recent change in behaviour creates difficulties when starting a new cluster and I'm open to suggestions. Ideally we would like to avoid the situation where a misconfiguration leads to split brain scenario as it has caused incidents to other users in the past.

tonyswu commented 3 months ago

@thampiotr I did try the seed node approach, but from a deployment standpoint it's rather finicky since you do want to make sure the seed node is terminated afterwards, and we ended up just doing it manually, and it's not very ideal.

I see the merit for the recent change, and I would suggest perhaps having some sort of timeout as a workaround (configurable, of course, with default being set to 0s if needed), so an option is available to tell the container to wait for service discovery to be ready. We run on ECS, and if we had a timeout of say 10 or 20 seconds this would be a solution.

db-wally007 commented 2 months ago

I think this change essentially breaks automatic cluster deployment (in our case using Ansible on bera metal) using anything but Kubernetes or some ugly workarounds.

I dont know (or care) which host is the seedbox as ansible can run on any host in any order.

Please revert back this change, as we will have to stay on 1.20 for foreseeable future or get away from cluster functionality as complexity would grow - and thats just not worth the effort if maintaining for monitoring agent.

Also, why this breaking change was not in the release notes ? I spent hours on this terrible change in fundamental functionality :((

thampiotr commented 2 months ago

@db-wally007

Also, why this breaking change was not in the release notes ?

It is in the release notes here.

I spent hours on this terrible change in fundamental functionality :((

Sorry that it took you long time and that you found the change I made terrible. You can see the reasons for this change explained above and on the PR. We have it as a consequence of real outages that users have experienced where their cluster ended up in a split brain due to cluster peers discovery issues. We also spent hours investigating split brain issues caused by the absence of this fail-fast mechanism.

We are considering making further changes to alleviate the problems that were mentioned on this issue, including your problem with Ansible. Please be patient and take into your consideration the trade-offs here with the risk of potentially creating a split cluster and outages.

db-wally007 commented 2 months ago

@db-wally007

Also, why this breaking change was not in the release notes ?

It is in the release notes here.

I spent hours on this terrible change in fundamental functionality :((

Sorry that it took you long time and that you found the change I made terrible. You can see the reasons for this change explained above and on the PR. We have it as a consequence of real outages that users have experienced where their cluster ended up in a split brain due to cluster peers discovery issues. We also spent hours investigating split brain issues caused by the absence of this fail-fast mechanism.

We are considering making further changes to alleviate the problems that were mentioned on this issue, including your problem with Ansible. Please be patient and take into your consideration the trade-offs here with the risk of potentially creating a split cluster and outages.

The 1.3.1 was supposed to bring back pre-1.30 behavior - but I'm finding this is not the case.

ts=2024-08-27T10:52:51.22077143Z level=warn msg="failed to connect to peers; bootstrapping a new cluster" service=cluster err="failed to join memberlist: 1 error occurred:\n\t* Failed to join 192.168.123.240:8045: Post \"http://192.168.123.240:8045/api/v1/ckit/transport/stream\": dial tcp 192.168.123.240:8045: connect: connection refused\n\n"

[root@grafana-alloy /]# alloy -v
alloy, version v1.3.1 (branch: HEAD, revision: e4979b2a2)
  build user:       root@87f64efb4e22
  build date:       2024-08-23T16:01:14Z
  go version:       go1.22.5
  platform:         linux/amd64
  tags:             netgo,builtinassets,promtail_journal_enabled
[root@grafana-alloy /]#

/bin/alloy run --disable-reporting=true --server.http.listen-addr=0.0.0.0:8045 --storage.path=/var/lib/alloy --cluster.enabled=true --cluster.rejoin-interval=10s --cluster.advertise-address=\"192.168.123.240:8045\" --cluster.name=podman_group --cluster.join-addresses=\"192.168.123.240, 192.168.123.186\" /etc/grafana-alloy/"