Replica Set initialization without election

sclorg / mongodb-container

MongoDB container images based on Red Hat Software Collections and intended for OpenShift and general usage. Users can choose between Red Hat Enterprise Linux, Fedora, and CentOS based images.

https://softwarecollections.org

Apache License 2.0

50 stars 179 forks source link

Replica Set initialization without election #140

Closed rhcarvalho closed 7 years ago

rhcarvalho commented 8 years ago

In a conversation with @php-coder today he brought an excellent question and idea:

Why does the pod running the replica set initiation code starts as a PRIMARY member, and then it steps down and triggers an election among the other members?

It would be much simpler if the post-deployment-hook pod simply pick one of the deployed pods and make it the primary, run all the initialization steps, add the other pods to the replica set, and quit. The hook pod don't need to run mongod.

This idea could simplify matters greatly, reduce the startup time, and reduce the surface area for bugs.

grdryn commented 8 years ago

Sounds good. There's an issue that I've seen a sometimes where the initiator details are still in the members list even after it drops out. This should at least cover that! :+1:

php-coder commented 8 years ago

Food for thoughts: currently all pods require authentication and you cannot connect to pod after its creation because it doesn't have any users. We are creating users right after adding node to the replica set.

omron93 commented 8 years ago

@php-coder is right. Mongod with keyFile (which implies auth) allows connection only from localhost (https://docs.mongodb.org/v2.6/core/authentication/#localhost-exception).

So only way how to get this proposal to work is to be possible to give some master token (post-deploy hook could specify some global environmental variable or something to notify pods who should start initialization. Is this somehow possible in OpenShift/Kubernetess?) to one of pods - to be able to choose which pod should be the first master.

rhcarvalho commented 8 years ago

A single deployment config with multiple replicas will use the same env vars for all pods. One way to distinguish "who should be master" is to set the master name in the env. Then each pod would check whether it should act as master or not. But then, the name is also assigned by the platform, we cannot put in the template... leaving this as food for thought.

Deterministic choice of master can be done if we have multiple services, 1 DC per service, 1 pod per DC, and in that case we can include persistent storage as well.

See https://www.mongodb.com/blog/post/leaf-in-the-wild-leading-soccer-streaming-service-fubotv-scales-its-business-with-mongodb-docker-containers-and-kubernetes

omron93 commented 8 years ago

Deterministic choice of master can be done if we have multiple services, 1 DC per service, 1 pod per DC, and in that case we can include persistent storage as well.

See https://www.mongodb.com/blog/post/leaf-in-the-wild-leading-soccer-streaming-service-fubotv-scales-its-business-with-mongodb-docker-containers-and-kubernetes

If there would be one service per each member (as mentioned in blog post) coutd it be possible to change the number of replicaset members on runtime?

rhcarvalho commented 8 years ago

If there would be one service per each member (as mentioned in blog post) coutd it be possible to change the number of replicaset members on runtime?

Not automatically with oc scale.

omron93 commented 8 years ago

Not automatically with oc scale.

This is not a requirement?

rhcarvalho commented 8 years ago

@omron93 for me, today, it's a debatable feature that looks good on a demo, but doesn't work well in production.

My hope for a better future is coming with being able to address pods by a stable hostname (https://github.com/kubernetes/kubernetes/pull/24362) and having scalable persistent storage (https://github.com/kubernetes/kubernetes/pull/18016), those will put us in a more ideal situation.

AFAICT replica sets in MongoDB are not meant to be used the way we do, adding/removing IPs from the configuration when a pod dies/restart. The replica set configuration should remain stable when a pod goes down an up again, so that election works the way it is meant to, and we don't have the known problems with zombie IPs in config, etc. The only way to make that work today is to use service IPs/hostnames instead of pod IPs in the config.

omron93 commented 8 years ago

Initialization of replicaset without election could be possible, even without necessity of initialization in post-deploy hook.

@bparees @rhcarvalho What about making MONGODB_INITIAL_REPLICA_COUNT mandatory? It is used in default example and without it there is a warning that it is better to set it:

Attention: MONGODB_INITIAL_REPLICA_COUNT is not set and it could lead to a improperly configured replica set. (https://github.com/sclorg/mongodb-container/blob/master/2.4/root/usr/share/container-scripts/mongodb/initiate_replica.sh#L17)

With this variable, pods could wait till all are included in endpoints() and then pod with lesser IP address could initialize replicaset and then other pods could add itself into it. So no post-hook needed.

bparees commented 8 years ago

@rhcarvalho has been more involved in this than I, but yeah making it required when trying to run in replica mode seems ok (obviously we don't want users to have to set it when running in standalone mode).

rhcarvalho commented 8 years ago

I think this idea requires experimentation.

In the long term I don't like much the MONGODB_INITIAL_REPLICA_COUNT manually provided by the end user because it doesn't play well with the platform / it doesn't scale.

I would rather see a different mechanism that integrates with the number of replicas from the platform. Too late at night now to remember if this is something that could be done using the Kubernetes downward API to fetch the number of replicas and set an env var to that value...

Of course if you're not using OpenShift/Kube, you'd still need to be able to use the image, though the replica set scenario becomes less and less likely to be usable.

I have nothing against experimentation and PRs trying to either:

just do the original idea of not having the initiator join the replica set
having no initiator container whatsoever

omron93 commented 7 years ago

@bparees Can be expected that MongoDB in OpenShift will use something different than PetSet?

bparees commented 7 years ago

@omron93 at this point i think StatefulSet (what used to be called PetSet) is the only reasonable choice for mongodb replication.

omron93 commented 7 years ago

OK, thanks. In that case, this issue is solved (creating MongoDB replication in Kubernetes using PetSet template does not require additional election).

Feel free to reopen.