kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.65k stars 39.55k forks source link

Rolling-update by node #18450

Closed titilambert closed 7 years ago

titilambert commented 8 years ago

Hello, I would like to make rolling update by node:

  1. Select one node
  2. Stop all pods of the RC on this node.
  3. Start new pods on this node
  4. Select an other node
  5. Stop all pods of the RC on this node.
  6. ...

Questions:

Thanks

nikhiljindal commented 8 years ago

No there is no way to do it right now.

Am curious why you want to do rolling update by node? Is the current rolling update mechanism not enough?

titilambert commented 8 years ago

I can't get instances with different version on the same node because my processes use shared memory.

bgrant0607 commented 8 years ago

In general, we should do rolling update by the failure domains pods are spread by.

To clarify: these pods are communicating via shared memory? How? Why not put all the containers in the same pod? I don't see how this would work without hard affinity #18265.

titilambert commented 8 years ago

@bgrant0607

titilambert commented 8 years ago

@nikhiljindal I just made an implementation of rolling update by node using labels on nodes. Do you think this could be integrated to kubectl ?

bgrant0607 commented 8 years ago

@titilambert Have you seen kubectl drain? Would that do what you need?

cc @mml @janetkuo @kargakis @mqliang

titilambert commented 8 years ago

@bgrant0607 Hello! I made a first draft of this using node selectors. (#22442) kubectl drain does not really fit, because, it stop scheduling on the current node. (https://github.com/kubernetes/kubernetes/blob/master/docs/user-guide/kubectl/kubectl_drain.md) In this first draft, only pods of the targeted RC is impacted of the rolling update and node can still received other pods from other RCs.

davidopp commented 8 years ago

Sorry, I just saw this issue. Would #9043 solve your problem?

titilambert commented 8 years ago

Hello ! Not really, the objectif of the rolling update by node is to be sure that you can never get 2 different version of the same RC running on the same node. I don't thing this issue cover that case.

bgrant0607 commented 8 years ago

@titilambert If you use a hostPort in your pods, only one can schedule per node.

We also have some anti-affinity features coming that may help: https://github.com/kubernetes/kubernetes/blob/master/docs/design/podaffinity.md

titilambert commented 8 years ago

@bgrant0607 Thanks, for your reply ! It sounds interesting, but I can not see how I can be sure that the old RC will not deploy new pod on the current node. One of the main requirement of this rolling update by node, is to delete all pods from the old RC (one the current node) then when there is no more pods on this node, start the creation of pods of the new RC. Maybe I missed something... could you give me more details of your thoughts ?

BTW, I'm pretty sure that anti-affinity feature will help this PR to be better (maybe getting this feature without using node Selector ?)

bgrant0607 commented 8 years ago

@titilambert I still don't understand the reason why you want to stop all pods on a given node at the same time. However, this sounds like a fairly niche use case.

Maybe there is something we could do to make this easier to implement outside of Kubernetes?

bgrant0607 commented 8 years ago

Additionally, as I mentioned in the PR, we're trying to reduce the amount of logic in kubectl (#12143).

djsly commented 8 years ago

Hi Brian, let me try to explain in more details the use case here.

We have this single thread service that requires a lot of RAM to be loaded (loads a model into it)

Since the process is single threaded, we run multiple instances of the service on the same machine, and we share the RAM across the different services using /dev/shm.

Now in k8s, we have managed to migrate the service into a single docker container and we can scale the POD accordantly. The main problem is that during rolling update, we cannot have service 1 and service 2 running and sharing /dev/shm with state X, start an update, have service 1 stop, restart and try to update /dev/shm since service 2 is already using it. (here we are assuming that service 1 will fill in /dev/shm with new data incompatible with service2's version)

So the only way right now to fix this (at the infra level) is to stop all the PODs running on the node, this ensures that the mounted /dev/shm is released by the last POD being destroyed. Once the host isn't running any POD of that service (v1) we can move towards the upgrade of the service (v2). At the service (v2) boots up, the first POD on the host will reload the /dev/shm and the subsequent POD will simply use the shared /dev/shm.

going one node at a time allows no down time of the service.

We do understand that this is not aligned with the micro services best approach but unfortunately, limitation to the service prevent us from going into a better paradigm that fit well in k8s.

Hopefully this niche case, is now clearer for you.

Regards!

Sylvain

On Fri, Apr 29, 2016 at 1:37 PM, Brian Grant notifications@github.com wrote:

Additionally, as I mentioned in the PR, we're trying to reduce the amount of logic in kubectl (#12143 https://github.com/kubernetes/kubernetes/issues/12143).

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/kubernetes/kubernetes/issues/18450#issuecomment-215825164

davidopp commented 8 years ago

In 1.3 we're adding "pod affinity" which lets you say "when you're trying to schedule a pod from [service, RC, RS, Job, whatever] X, only do so on a node that is already running at least one pod from [service, RC, RS, Job, whatever] Y.

There is a variant of this (that we're not implementing in 1.3, but might later) that says "in addition, if the pod from Y stops running, then kill X"

If you really only have two services, then this variant (that we're not implementing in 1.3) sounds like it would solve your problem. In particular: 1) Create Y 2) Create X, give it pod affinity so that it can only run on nodes with Y 3) Later, update the pod templates in the RC's for X and Y 4) Run a script that walks through the nodes in the cluster, killing Y. This will in turn also kill X and both will reload with their updated pod template. Of course there is a bit of a race here, where you need some way to make sure X dies before Y restarts.

I'm not saying this is the best way to address your problem, and of course it's hard to compare one nonexistent solution to other nonexistent solutions, but I thought I'd mention it, as this at least fits in with something we're building.

djsly commented 8 years ago

Hi Daniel. Thanks for the feedback.  I might have not explained it well the first time.  In my case, service1 and service2 are in fact the same RC. It's just a different replica. I would like to use the concept of instance number appended to the service name...   We can run up to 24 replicas of the same POD on one node.  When we do a rolling upgrade, that's when we need that any new replica of the new RC be started on a particular node only when the previous RC's replicas are all stopped on that particular node. To prevent the corruption of the mounted /dev/shm partition. 

Regards 

davidopp commented 8 years ago

Hi @djsly. Thanks for the clarification. Now I understand -- you want to "roll" one node at a time rather than one replica at a time, and you want to ensure that no updated replica starts on the node until all of the old replicas on the node have been killed.

There's no automated way to do what you're asking. But here's an approach that might be good enough.

Let's say rc1 is the ReplicationController that's managing the current replicas, rc1's PodSpec has a label selector "version=1", and all the nodes in the cluster start out labeled "version=1"

First, you create rc2, a ReplicationController that will manage the new version; it is identical to rc1 except it uses the image name you're upgrading to and it has label selector "version=2" instead of "version=1" (and its name is rc2 instead of rc1, of course). Then

for (each node N)
   set N's NodeSpec.Unschedulable = true
   delete all the pods on N; wait for them to actually be gone
   change N's label from version=1 to version=2
   set N's NodeSpec.Unschedulable = false

Once you're done upgrading the nodes, you can delete rc1.

I realize this isn't perfect, but I think it's the closest you can get without writing your own controller.

djsly commented 8 years ago

Hi David, Thanks for the proposal!

This is indeed exactly what we coded in this PR https://github.com/kubernetes/kubernetes/issues/22442 with the only exception that we kept

NodeSpec.Unschedulable = true

to allow other RC to deploy replicas that are independent of the /dev/shm, hence making the node's resources still available for other type of services.

What we would like to do in the end is provide upstream with the changes to support such scenario such that we could stop relying on our own fork of the project and eventually get back at using the official releases.

We understand that this should be coded server side which makes lot of senses and would like to get guidance on what you guys would prefer to ensure that we can work on getting a future PR accepted.

Thanks!

Sylvain

djsly commented 7 years ago

@davidopp, if we are interested to resume this work in migrating the previous PR to the deployment object, where would be the best place to start in terms of proposal ? is #sig-apps the right venue for initial design discussion ?

davidopp commented 7 years ago

Yes, sig-apps is probably the right place.

davidopp commented 7 years ago

Hi, sorry we did not get a chance to talk in-person at KubeCon.

Is it possible to do this using your own client? We now have a go client: https://github.com/kubernetes/client-go

If you want to see this built into Deployment, you should write a proposal and discuss it with sig-apps.

djsly commented 7 years ago

@titilambert I guess we can close this now that we have coded the logic on the client side for now. Eventually we will be looking at using either Operators from coreOS or directly third_party_resources.

0xmichalis commented 7 years ago

@djsly mind sharing your implementation if it's open-sourced?