grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
3.88k stars 476 forks source link

Proposal: deprecate consul and etcd backend storage for the hash ring #2949

Open pracucci opened 1 year ago

pracucci commented 1 year ago

Mimir runs the hash ring on memberlist by default. Using memberlist is the easiest way and, at Grafana Labs, we're not aware of any open issue (bug) using it at any scale.

In addition to memberlist, Mimir also supports consul and etcd as backend storages for the hash ring. Consul and etcd offers a centralised way to store the hash ring, at the cost of requiring an external dependency and some known scalability issues.

What's the sentiment if we deprecate consul and etcd backend storage for the hash ring? Please report if you're using it and can't migrate to memberlist for any reason.

This proposal does not affect the backends used for the HA tracker.

bboreham commented 1 year ago

I think it would be cleaner to get memberlist working for HA first, then deprecate all external KVs.

(note I did not say "easier")

pracucci commented 1 year ago

I think it would be cleaner to get memberlist working for HA first, then deprecate all external KVs.

Ideally yes. The reason why I would like to officially deprecate it is to signal the community to move to memberlist (or keep memberlist for new adopters). For this specific deprecation, we could also offer a longer deprecation period (not just two releases which would be quite tight).

tristian-dodd commented 1 year ago

I'm using nomad instead of K8s for my mimir deployment. I was encountering issues with memberlist communicating through the consul service mesh. I switched to using consul and haven't encountered any issues with it. I would prefer for consul to still be supported.

wilfriedroset commented 1 year ago

I'm using consul as it is a central piece of my infrastructure and my automation before spawning mimir. I've no issues with it.

lemon-li commented 1 year ago

If K8s is not used and the network mode is host, how should the memberlist join parameter be configured? Does each component need to be write in the join list? If i use consul or etcd, I don't need to configure the endpoints of these components.

pracucci commented 1 year ago

Very good feedback, community. Please keep commenting with your use cases not well covered by memberlist. We'll not deprecate consul/etcd if not all your use cases are supported by memberlist.

If K8s is not used and the network mode is host, how should the memberlist join parameter be configured? Does each component need to be write in the join list?

It should be configured to a subset of the nodes in your cluster (IP and port). The subset will be your seed nodes. Whether this set can be assumed being static (e.g. static IPs) depends on your actual infrastructure. An alternative would be having a DNS entry resolving to all your Mimir replicas and then configure that DNS entry as seed nodes, but whether this is feasible or not depends on the actual infrastructure.

wilfriedroset commented 1 year ago

An alternative would be having a DNS entry resolving to all your Mimir replicas and then configure that DNS entry as seed nodes, but whether this is feasible or not depends on the actual infrastructure.

This is what I'm doing with consul.

{
    "service": {
        "address": "[redacted]",
        "checks": [
            {
                "http": "https://[redacted]:9009/ready",
                "interval": "5s",
                "tls_skip_verify": true
            }
        ],
        "enable_tag_override": false,
        "id": "mimir",
        "name": "mimir",
        "port": 9009,
        "tags": [
            "ingester"
        ]
    }
}

All hosts running mimir are registered in consul service discovery with one or more tag. This result in a DNS entry like mimir.service.consul or ingester.mimir.service.consul

lemon-li commented 1 year ago

The subset will be your seed nodes

Very good feedback, community. Please keep commenting with your use cases not well covered by memberlist. We'll not deprecate consul/etcd if not all your use cases are supported by memberlist.

If K8s is not used and the network mode is host, how should the memberlist join parameter be configured? Does each component need to be write in the join list?

It should be configured to a subset of the nodes in your cluster (IP and port). The subset will be your seed nodes. Whether this set can be assumed being static (e.g. static IPs) depends on your actual infrastructure. An alternative would be having a DNS entry resolving to all your Mimir replicas and then configure that DNS entry as seed nodes, but whether this is feasible or not depends on the actual infrastructure.

If running on the host network, do I need to resolve the port conflict 7946 if there are multiple services deployed on the same host ?

pracucci commented 1 year ago

If running on the host network, do I need to resolve the port conflict 7946 if there are multiple services deployed on the same host ?

Yes, you would have to use a different port for each replica, but the same applies to the gRPC and HTTP port.

lemon-li commented 1 year ago

If running on the host network, do I need to resolve the port conflict 7946 if there are multiple services deployed on the same host ?

Yes, you would have to use a different port for each replica, but the same applies to the gRPC and HTTP port.

My cluster supports dynamic ports, but my configuration of member_list should be complicated, right? for examples:

host_1 instance run compactor-1, querier-1
host_2 instance run query-frontend-1, querier-2

memberlist:
  join_members: [
     compactor-1:7946
     querier-1:7947
     query-frontend-1:7946
     querier-2:7947
     .....
  ]

If I scale out replica, I think need to manually update this memberlist join_members, Is there a better solution without consul?

pracucci commented 1 year ago

If I scale out replica, I think need to manually update this memberlist join_members, Is there a better solution without consul?

The solution could be configuring join_members with a DNS address which resolves to all instances (if you use a SRV record then you should be able to also specify the port, that can be different for each instance, see DNS-based service discovery doc). Then when you scale up/down you should update the DNS entry (whether this can be easily automated or not depends on your actual infrastructure).

_Note: I haven't personally tried the join_members with a SRV record but given the same resolver is used in other places where we use SRV records then it should work (if doesn't, then it's something we could fix)._

lemon-li commented 1 year ago

If I scale out replica, I think need to manually update this memberlist join_members, Is there a better solution without consul?

The solution could be configuring join_members with a DNS address which resolves to all instances (if you use a SRV record then you should be able to also specify the port, that can be different for each instance, see DNS-based service discovery doc). Then when you scale up/down you should update the DNS entry (whether this can be easily automated or not depends on your actual infrastructure).

_Note: I haven't personally tried the join_members with a SRV record but given the same resolver is used in other places where we use SRV records then it should work (if doesn't, then it's something we could fix)._

Thanks, i may be need to add SRV Record for AWS Route53, DNS records may not be automated set up if I change to memberlist, So, consul is already a service of our infrastructure, I would prefer for consul to still be supported.

pracucci commented 1 year ago

So, consul is already a service of our infrastructure, I would prefer for consul to still be supported.

Makes sense. I see that using memberlist may be more complicated than etcd/consul if you're not running on Kubernetes. We'll keep etcd/consul support until we'll not have a better solution for discovering memberlist seed nodes.

ryan-dyer-sp commented 9 months ago

m2c. https://grafana.com/docs/mimir/latest/configure/configure-high-availability-deduplication/#configure-the-ha-tracker-kv-store If etcd or consul is still required for this functionality, then seems off to me to force deprecation of a different component. Ideally, I'd love to not have either of these dependencies but for ha_tracker, its still a requirement. And for us we're stuck with etcd as the consul integration doesnt support TLS; which for us is a requirement. Even though I find managing a consul cluster in k8s much easier than etcd.