orchestrator-raft: are IP addresses really required?

bbeaudreault commented 7 years ago

The docs mention that RaftBind and RaftNodes should be IP addresses. Is this really true, and if so is there a reason?

For those of us running in kubernetes, this is hard. Ideally we can use the stable hostname of the pods in the replica set. It's possible to set up 3-5 distinct deployments each with their own service with a stable IP address, then use that. But that's a bit more work and more cumbersome long term than to just use the built in primitives.

shlomi-noach commented 7 years ago

@bbeaudreault the hashicorp raft library, which I use, explicitly resolved IP addresses of whatever you provide, and then compares anything to that. I've done two rounds trying to go around this (I don't like it either) but insofar without success.

bbeaudreault commented 7 years ago

Ok thanks for the info. If you want, we could close this issue and I can follow the next steps issue instead. Or we can keep this open. Up to you. Will follow the other one either way. Thanks!

shlomi-noach commented 7 years ago

Let's keep this one open.

shlomi-noach commented 7 years ago

@bbeaudreault if instead of arguing with the hashicorp/raft library, I let you specify FQDN and resolve that to an IP upon startup (and from there on only use those IP addresses) -- does that solve your problem?

bbeaudreault commented 7 years ago

I appreciate the effort, but unfortunately does not solve the problem. In kubernetes the pods can move around which cause the IPs to change. However, I say let's keep the change for now but not resolve -- it does make it a little easier to be able to use FQDN at startup rather than hard coding the IPs.

shlomi-noach commented 7 years ago

That's unfortunate. It would require some rewrite of the hashicorp/raft module. Would you like to put a question on their repo? https://github.com/hashicorp/raft/issues

shlomi-noach commented 6 years ago

If I understand correctly, support for this has recently been added in 1.0.0:

I have a similar interest in making this work. Will look into the recent hashicorp/raft changes.

hmcgonig commented 6 years ago

The above would be fantastic for kubernetes deployments. Currently, the workaround for this problem is assigning each node a service, then using the static assigned cluster IPs which don't change.

shlomi-noach commented 6 years ago

Currently, the workaround for this problem is assigning each node a service, then using the static assigned cluster IPs which don't change.

This is actually the solution I'm looking at right now and which in fact seems to satisfy my requirements: I'm looking at having a single orchestrator node per-DC (as we do today), which for us means a single orchestrator node in a Kubernetes cluster. Putting it behind a service with LoadBalancer makes a lot of sense.

This is still WIP.

The Consul 1.0.0 also introduces major changes to the initial raft setup, which I'm unhappy with, so I will try to push that till a later stage.

hkotka commented 4 years ago

I ran into this issue with Docker Swarm setup and there is unfortunately no way to set static ip-addresses to Swarm services, so I have to re-think the architecture a bit. It would be great if Orchestrator could periodically re-resolve RaftNodes FQDN:s if possible :)

sbrattla commented 4 years ago

@bbeaudreault and @hmcgonig would you mind providing a little more details on how you went about with orchestrator on Kubernetes. My understanding is that you deploy, say, 3 services where each services represents a single instance (peer) or orchestrator. Each orchestrator instance is then reachable via it's service hostname. Is this understanding correct?

I'm using Docker Swarm; but the concepts should be quite similar to Kubernetes. However, I cannot use the IP associated with a service in RaftBind as orchestrator then logs an error which says it cannot bind to that address (which makes sense, since it's essentially a load balancer IP). A service replica should not be able to bind to the service IP. Subsequently, since the IP in RaftBind must also appear in RaftNodes, I hit the wall.

shlomi-noach commented 4 years ago

I tried giving this another go; with the time frame I had, I wasn't able to adapt to the new hashicorp/raft library; there's been a lot of architectural changes and adapting orchestrator to these changes is non trivial.

shlomi-noach commented 4 years ago

See https://github.com/openark/orchestrator/pull/1208 for potential alternate approach

sbrattla commented 4 years ago

Thanks for giving it a try @shlomi-noach! I may give #1208 a try, or alternatively just set up three small servers to run the software in a "classic" setup.

jianhaiqing commented 4 years ago

The above would be fantastic for kubernetes deployments. Currently, the workaround for this problem is assigning each node a service, then using the static assigned cluster IPs which don't change.

"RaftEnabled": true,
"RaftDataDir": "/var/lib/orchestrator",
 "RaftBind": "orch-node1-0:10008",
 "RaftAdvertise": "orch-node1.raft:10008",
 "DefaultRaftPort": 10008,
 "RaftNodes": [
      "orch-node1.raft:10008",
      "orch-node2.raft:10008",
      "orch-node3.raft:10008",
      "orch-node4.raft:10008",
      "orch-node5.raft:10008"
 ]

redstonemercury commented 4 years ago

I'm having issues getting 3 node raft orchestrator deployed on k8s and it seems to be related to this, since once a pod goes down in the statefulSet, the remaining pods are resolving the hostname to the old IP and only re-resolve once they come up new (at which point they have a different IP and trigger the other pods to need a stop/start). Feels like the underlying raft protocol should re-resolve the underlying hostnames, but I also understand why this change isn't easy.

If the workaround is to deploy 3 separate services with static IPs, how are y'all getting this working to ensure you don't have multiple pods go down at the same time? Or are you just taking the risk that all the pods could move around simultaneously creating orchestrator downtime? Or am I missing something about the implementation for 3 separate services within the same deployment?

derekperkins commented 4 years ago

Since you're in k8s, you would just use a pod disruption budget to keep multiple pods from going down simultaneously.

sudo-ankitmishra commented 3 years ago

I am facing issues while deploying 3 node raft setup on k8s, when I checked the logs for one of the orchestrator node it says that "failed to open raft store, unknown port." I have 3 separate deployment for orchestrator and I am using the IP of orchestrator service in my config file: sample raft config: "RaftEnabled": true, "RaftDataDir": "/var/lib/orchestrator", "DefaultRaftPort": 10008, "RaftBind": "http://192.168.50.175:10008", "RaftNodes": ["http://192.168.50.175:10008", "http://192.168.50.176:10008", "http://192.168.50.177:10008"]

Any help will be appriciated.

shlomi-noach commented 3 years ago

@sudo-ankitmishra remove the "http://. In the future, please make sure to open a new issue, so we can better track questions. Thank you!

sudo-ankitmishra commented 3 years ago

Hey @shlomi-noach thanks for reply, But I am still not able to get the orchestrator up and running, this is the error I am receiving right now, 2020-10-02 15:18:06 DEBUG Connected to orchestrator backend: orchestrator:?@tcp(10.32.0.88:3306)/orchestrator?timeout=2s 2020-10-02 15:18:06 DEBUG Orchestrator pool SetMaxOpenConns: 128 2020-10-02 15:18:06 DEBUG Initializing orchestrator 2020-10-02 15:18:06 INFO Connecting to backend 10.32.0.88:3306: maxConnections: 128, maxIdleConns: 32 2020-10-02 15:18:06 INFO Starting Discovery 2020-10-02 15:18:06 INFO Registering endpoints 2020-10-02 15:18:06 INFO continuous discovery: setting up 2020-10-02 15:18:06 DEBUG Setting up raft 2020-10-02 15:18:06 DEBUG raft: advertise=192.168.50.177:10008 2020-10-02 15:18:06 ERROR failed to open raft store: listen tcp 192.168.50.177:10008: bind: cannot assign requested address 2020-10-02 15:18:06 FATAL 2020-10-02 15:18:06 ERROR failed to open raft store: listen tcp 192.168.50.177:10008: bind: cannot assign requested address

shlomi-noach commented 3 years ago

In RaftNodes, pplease also remove the :port, keep just IPs. And please, open a new issue for any further discussion, thank you.

redstonemercury commented 3 years ago

I just want to follow up here to add that a PodDisruptionBudget does not appear to work around this. (I can start a new issue if requested, but this problem is a direct result of the inability to use a StatefulSet, so thought it made sense to continue this discussion here.)

From the documentation here https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#how-disruption-budgets-work:

"Pods which are deleted or unavailable due to a rolling upgrade to an application do count against the disruption budget, but workload resources (such as Deployment and StatefulSet) are not limited by PDBs when doing rolling upgrades. Instead, the handling of failures during application updates is configured in the spec for the specific workload resource."

So in this case, having 3 separate Deployment resources each managing one pod, you cannot appear to use a PodDisruptionBudget to prevent all 3 pods from going down when you (for example) update a config setting in a configmap used by that pod. The PodDisruptionBudget sees that you're over the disruptions properly, but does not prevent a restart of each independent Deployment's pod simultaneously, since it assumes you're managing that within the Deployment resource.

This is my setup, and testing multiple times, updating a configmap means that all 3 containers (each in their own Deployment) restart at the same time.

Pods example:

Name:         orchestrator-0-xxxxxxx-xxxx
Namespace:    kube-monitoring
Priority:     0
Node:         ip-x-x-x-x.us-west-2.compute.internal/x.x.x.x
Start Time:   Tue, 17 Nov 2020 03:40:41 -0500
Labels:       app.kubernetes.io/app=orchestrator
              app.kubernetes.io/instance=orchestrator
              app.kubernetes.io/name=orchestrator-0
              pod-template-hash=xxxxxxx
...
Name:         orchestrator-1-xxxxxxx-xxxx
Namespace:    kube-monitoring
Priority:     0
Node:         ip-x-x-x-x.us-west-2.compute.internal/x.x.x.x
Start Time:   Fri, 13 Nov 2020 11:15:41 -0500
Labels:       app.kubernetes.io/app=orchestrator
              app.kubernetes.io/instance=orchestrator
              app.kubernetes.io/name=orchestrator-1
              pod-template-hash=-xxxxxxx
...
Name:         orchestrator-2-xxxxxxx-xxxx
Namespace:    kube-monitoring
Priority:     0
Node:         ip-x-x-x-x.us-west-2.compute.internal/x.x.x.x
Start Time:   Mon, 16 Nov 2020 08:59:21 -0500
Labels:       app.kubernetes.io/app=orchestrator
              app.kubernetes.io/instance=orchestrator
              app.kubernetes.io/name=orchestrator-1
              pod-template-hash=-xxxxxxx

PodDisruptionBudget details:

Name:           orchestrator
Namespace:      kube-monitoring
Min available:  2
Selector:       app.kubernetes.io/app=orchestrator
Status:
    Allowed disruptions:  1
    Current:              3
    Desired:              2
    Total:                3
Events:                   <none>

I can see them restart simultaneously when I update a config, creating a downtime:

orchestrator-0-xxxxxx-xxxx                  0/1     ContainerCreating   0          17s
orchestrator-1-xxxxxx-xxxx                  0/1     ContainerCreating   0          16s
orchestrator-2-xxxxxx-xxxx                  0/1     ContainerCreating   0          14s

The PodDisruptionBudget sees this happening, but does not seem to prevent it:

Name:           orchestrator
Namespace:      kube-monitoring
Min available:  2
Selector:       app.kubernetes.io/app=orchestrator
Status:
    Allowed disruptions:  0
    Current:              0
    Desired:              2
    Total:                3
Events:                   <none>

Am I doing something wrong in my setup above? Or am I properly understanding from Kubernetes documentation that a PodDisruptionBudget will not manage restart behavior across Deployments? If the latter, how do folks have a highly available application setup in Kubernetes while using separate Deployment resources?

openark / orchestrator

orchestrator-raft: are IP addresses really required? #253