Rolling update for shinyproxy deployment causes orphan pods

ramkumarg1 commented 5 years ago

Hi, when there is a change in application.yaml and the rolling update is chosen (with replicas set to 0 and then back to 1) - mainly because the new shinyproxy image needs to be downloaded from the artifactory - All the earlier pods that were spun up by the previous shinyproxy get left behind as zombie's

To reproduce:

kubectl get all

NAME READY STATUS RESTARTS AGE pod/shinyproxy-7f76d48c79-8x9hs 2/2 Running 0 41m

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/shinyproxy NodePort 172.30.85.191 8080:32094/TCP 40m

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE deployment.apps/shinyproxy 1 1 1 1 41m

NAME DESIRED CURRENT READY AGE replicaset.apps/shinyproxy-7f76d48c79 1 1 1 41m

NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD route.route.openshift.io/shinyproxy shinyproxy-aap.apps.cpaas.service.test shinyproxy None

Logon to the app (In my case I am using LDAP auth and /app_direct/ to a shiny application (new pod for the application is spun up) - as expected

NAME READY STATUS RESTARTS AGE pod/shinyproxy-7f76d48c79-8x9hs 2/2 Running 0 43m pod/sp-pod-e7603441-03ba-470b-925a-22cfba1716de 1/1 Running 0 12s

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/shinyproxy NodePort 172.30.85.191 8080:32094/TCP 43m

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE deployment.apps/shinyproxy 1 1 1 1 43m

NAME DESIRED CURRENT READY AGE replicaset.apps/shinyproxy-7f76d48c79 1 1 1 43m

NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD route.route.openshift.io/shinyproxy shinyproxy-aap.apps.cpaas.service.test shinyproxy None

After the new shinyproxy image build:

kubectl scale --replicas=0 deployment/shinyproxy deployment.extensions/shinyproxy scaled

kubectl scale --replicas=1 deployment/shinyproxy deployment.extensions/shinyproxy scaled

New Image has been downloaded for shiny proxy and container being created.

NAME READY STATUS RESTARTS AGE pod/shinyproxy-7f76d48c79-l5fvw 0/2 ContainerCreating 0 4s pod/sp-pod-e7603441-03ba-470b-925a-22cfba1716de 1/1 Running 0 1m

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/shinyproxy NodePort 172.30.85.191 8080:32094/TCP 44m

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE deployment.apps/shinyproxy 1 1 1 0 45m

NAME DESIRED CURRENT READY AGE replicaset.apps/shinyproxy-7f76d48c79 1 1 0 45m

NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD route.route.openshift.io/shinyproxy shinyproxy-aap.apps.cpaas.service.test shinyproxy None

At this stage my web-application is irresponsive - the only thing to do is to close the tab/window. And the pod (for the R application) continues to stay unless manually deleted.
The pod which is consuming resources is not accessible, because the new service points to the updated deployment and application can be only accessed through a route over the service
It also is very difficult to identify which of the pods are the stale ones and delete manually

dseynaev commented 5 years ago

Hi @ramkumarg1

When shinyproxy receives as SIGTERM signal (when the deployment is scaled down), it should gracefully terminate by stopping all application pods first. You may have to increase the grace period terminationGracePeriodSeconds in the pod spec (default is 30s). If shinyproxy is unable to terminate within this period, it will receive a SIGKILL and be terminated immediately, leaving behind orphan pods. More info here: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/

ramkumarg1 commented 5 years ago

Thanks @dseynaev I changed the deployment spec to include terminationGracePeriodSeconds - but it didnt' make a difference. The pod was killed immediately - Perhaps, this issue is linked to https://github.com/kubernetes/kubernetes/issues/47576 where spring boot needs to handle SIGTERM gracefully?

spec:
  terminationGracePeriodSeconds : 180
  containers:
  - name: shinyproxy

muscovitebob commented 4 years ago

We observe the same issue with zombie pods, and for us the termination grace period setting also does not resolve this.

fmannhardt commented 4 years ago

I have the same issue and this is what is logged by shiny/containerproxy upon termination:

2020-01-30 10:56:56.785  INFO 1 --- [           main] e.o.c.ContainerProxyApplication          : Started ContainerProxyApplication in 39.115 seconds (JVM running for 43.619)
2020-01-30 10:57:01.374  INFO 1 --- [  XNIO-2 task-1] io.undertow.servlet                      : Initializing Spring FrameworkServlet 'dispatcherServlet'
2020-01-30 10:57:01.375  INFO 1 --- [  XNIO-2 task-1] o.s.web.servlet.DispatcherServlet        : FrameworkServlet 'dispatcherServlet': initialization started
2020-01-30 10:57:01.507  INFO 1 --- [  XNIO-2 task-1] o.s.web.servlet.DispatcherServlet        : FrameworkServlet 'dispatcherServlet': initialization completed in 131 ms
2020-01-30 10:57:26.275  INFO 1 --- [ XNIO-2 task-16] e.o.containerproxy.service.UserService   : User logged in [user: **]
2020-01-30 10:57:35.802  INFO 1 --- [  XNIO-2 task-3] e.o.containerproxy.service.ProxyService  : Proxy activated [user: ***] [spec: insight] [id: 9274ad33-665a-4d47-bab5-6c4b39a618b8]
2020-01-30 10:59:02.376  INFO 1 --- [       Thread-2] ConfigServletWebServerApplicationContext : Closing org.springframework.boot.web.servlet.context.AnnotationConfigServletWebServerApplicationContext@2b2948e2: startup date [Thu Jan 30 10:56:24 GMT 2020]; root of context hierarchy
2020-01-30 10:59:02.377 ERROR 1 --- [pool-4-thread-1] java.io.InputStreamReader                : Error while pumping stream.
java.io.EOFException: null
    at okio.RealBufferedSource.require(RealBufferedSource.java:61) ~[okio-1.15.0.jar!/:na]
    at okio.RealBufferedSource.readHexadecimalUnsignedLong(RealBufferedSource.java:303) ~[okio-1.15.0.jar!/:na]
    at okhttp3.internal.http1.Http1Codec$ChunkedSource.readChunkSize(Http1Codec.java:469) ~[okhttp-3.12.0.jar!/:na]
    at okhttp3.internal.http1.Http1Codec$ChunkedSource.read(Http1Codec.java:449) ~[okhttp-3.12.0.jar!/:na]
    at okio.RealBufferedSource$1.read(RealBufferedSource.java:439) ~[okio-1.15.0.jar!/:na]
    at java.io.InputStream.read(InputStream.java:101) ~[na:1.8.0_171]
    at io.fabric8.kubernetes.client.utils.BlockingInputStreamPumper.run(BlockingInputStreamPumper.java:49) ~[kubernetes-client-4.2.2.jar!/:na]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_171]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_171]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_171]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_171]
    at java.lang.Thread.run(Thread.java:748) [na:1.8.0_171]
2020-01-30 10:59:02.394  INFO 1 --- [       Thread-2] o.s.j.e.a.AnnotationMBeanExporter        : Unregistering JMX-exposed beans on shutdown
2020-01-30 10:59:02.403  INFO 1 --- [       Thread-2] o.s.j.e.a.AnnotationMBeanExporter        : Unregistering JMX-exposed beans
2020-01-30 10:59:02.514  WARN 1 --- [       Thread-2] .s.c.a.CommonAnnotationBeanPostProcessor : Invocation of destroy method failed on bean with name 'proxyService': eu.openanalytics.containerproxy.ContainerProxyException: Failed to stop container
2020-01-30 10:59:02.525  INFO 1 --- [       Thread-2] io.undertow.servlet                      : Destroying Spring FrameworkServlet 'dispatcherServlet'

fmannhardt commented 4 years ago

I found a solution for this issue. This is not actually a problem in shinyproxy or containerproxy as the Spring Boot app is correctly and gracefully shut down.

The problem is the kubctl proxy sidecar container. For Kubernetes it is not clear that containerproxy relies on the sidecar container to communicate with Kubernetes itself. So, on a new deployment Kubernetes will send SIGTERM to both the proxy and the sidecar container in all the old pods. The sidecar container will terminate immediately and containerproxy fails to communicate with Kubernetes.

I read that Kubernetes is about to solve these startup and shutdown dependencies in v1.18 as documented here: https://github.com/kubernetes/enhancements/issues/753 https://banzaicloud.com/blog/k8s-sidecars/

Until then there is a simple workaround to put the following lifecycle annotation to the sidecar container:

          lifecycle:
            preStop:
              exec:
                command: ["sh", "-c", "sleep 5"] # wait 5 seconds to let shinyproxy remove the pods on graceful shutdown

muscovitebob commented 4 years ago

I can confirm @fmannhardt's fix resolves this. Thank you so much!

LEDfan commented 3 years ago

Hi all

With recent versions of ShinyProxy (I'm not sure which version exactly, but at least ShinyProxy 2.3.1) there is no need to use a kube-proxy sidecar. ShinyProxy automatically detects the location and authentication of the Kubernetes API. Therefore I think this problem is automatically solved. Nevertheless, thank you for your time and investigation!

openanalytics / shinyproxy

Rolling update for shinyproxy deployment causes orphan pods #169