Pod fails to start after modifying the Nexus resource

bszeti commented 3 years ago

Describe the bug Modifying the Nexus resource triggers a new Deployment, but the new Pod can't start (CrashLoopBackOff) because the previous one is still holding the /nexus-data/lock. The problem is probably cause by the Deployment using spec.strategy=RollingUpdate. Using "Recreate" may help, so previous Nexus instance is shut down before the new one is created.

To Reproduce Steps to reproduce the behavior:

Create a Nexus resource
Modify the Nexus resource (e.g. mem requests)
The new Pod won't start, but keeps crashing.

Expected behavior The Deployment is successfully rolled out.

Environment OpenShift 4.6.5 Client Version: 4.4.30 Server Version: 4.6.5 Kubernetes Version: v1.19.0+9f84db3

Additional context Add any other context about the problem here.

ricardozanini commented 3 years ago

Hi @bszeti ! Thanks for filling this bug. I'll take a look at it today. That one could be tricky since we have some use cases that RollingUpdate would be a better fit. @LCaparelli any issues regarding the update scenario?

LCaparelli commented 3 years ago

Hey @bszeti thanks for raising the issue. As @ricardozanini mentioned the RollingUpdate option is a better fit for the way automatic updates are handled, so that your current deployment is not unavailable while the updated deployment is still being performed.

@bszeti can you please provide how you installed the operator and what version you're running? Along with that, please also share the output from:

$ oc describe pod

Be sure to run it in the project on which the failing pod is.

I'll try to replicate the issue, but so far here's what I have, installing v0.4.0 via OLM:

Click to expand

First, let's install OLM and the operator via OLM: ``` ╰─ kubectl version Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:56:40Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T14:23:04Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"} ╰─ curl -sL https://github.com/operator-framework/operator-lifecycle-manager/releases/download/v0.17.0/install.sh | bash -s v0.17.0 customresourcedefinition.apiextensions.k8s.io/catalogsources.operators.coreos.com created customresourcedefinition.apiextensions.k8s.io/clusterserviceversions.operators.coreos.com created customresourcedefinition.apiextensions.k8s.io/installplans.operators.coreos.com created customresourcedefinition.apiextensions.k8s.io/operatorgroups.operators.coreos.com created customresourcedefinition.apiextensions.k8s.io/operators.operators.coreos.com created customresourcedefinition.apiextensions.k8s.io/subscriptions.operators.coreos.com created customresourcedefinition.apiextensions.k8s.io/catalogsources.operators.coreos.com condition met customresourcedefinition.apiextensions.k8s.io/clusterserviceversions.operators.coreos.com condition met customresourcedefinition.apiextensions.k8s.io/installplans.operators.coreos.com condition met customresourcedefinition.apiextensions.k8s.io/operatorgroups.operators.coreos.com condition met customresourcedefinition.apiextensions.k8s.io/operators.operators.coreos.com condition met customresourcedefinition.apiextensions.k8s.io/subscriptions.operators.coreos.com condition met namespace/olm created namespace/operators created serviceaccount/olm-operator-serviceaccount created clusterrole.rbac.authorization.k8s.io/system:controller:operator-lifecycle-manager created clusterrolebinding.rbac.authorization.k8s.io/olm-operator-binding-olm created deployment.apps/olm-operator created deployment.apps/catalog-operator created clusterrole.rbac.authorization.k8s.io/aggregate-olm-edit created clusterrole.rbac.authorization.k8s.io/aggregate-olm-view created operatorgroup.operators.coreos.com/global-operators created operatorgroup.operators.coreos.com/olm-operators created clusterserviceversion.operators.coreos.com/packageserver created catalogsource.operators.coreos.com/operatorhubio-catalog created Waiting for deployment "olm-operator" rollout to finish: 0 of 1 updated replicas are available... deployment "olm-operator" successfully rolled out Waiting for deployment "catalog-operator" rollout to finish: 0 of 1 updated replicas are available... deployment "catalog-operator" successfully rolled out Package server phase: Installing Package server phase: Succeeded deployment "packageserver" successfully rolled out ╰─ kubectl create -f https://operatorhub.io/install/nexus-operator-m88i.yaml subscription.operators.coreos.com/my-nexus-operator-m88i created ╰─ kubectl get csv -n operators -w NAME DISPLAY VERSION REPLACES PHASE nexus-operator.v0.4.0 Nexus Operator 0.4.0 nexus-operator.v0.3.0 nexus-operator.v0.4.0 Nexus Operator 0.4.0 nexus-operator.v0.3.0 nexus-operator.v0.4.0 Nexus Operator 0.4.0 nexus-operator.v0.3.0 nexus-operator.v0.4.0 Nexus Operator 0.4.0 nexus-operator.v0.3.0 Pending nexus-operator.v0.4.0 Nexus Operator 0.4.0 nexus-operator.v0.3.0 InstallReady nexus-operator.v0.4.0 Nexus Operator 0.4.0 nexus-operator.v0.3.0 InstallReady nexus-operator.v0.4.0 Nexus Operator 0.4.0 nexus-operator.v0.3.0 Installing nexus-operator.v0.4.0 Nexus Operator 0.4.0 nexus-operator.v0.3.0 Installing nexus-operator.v0.4.0 Nexus Operator 0.4.0 nexus-operator.v0.3.0 Succeeded nexus-operator.v0.4.0 Nexus Operator 0.4.0 nexus-operator.v0.3.0 Succeeded ^C% ``` So far so good. Let's instantiate a Nexus CR using the sample from OperatorHub: ``` ╰─ echo "apiVersion: apps.m88i.io/v1alpha1 kind: Nexus metadata: name: nexus3 spec: networking: expose: false persistence: persistent: false replicas: 1 resources: limits: cpu: '2' memory: 2Gi requests: cpu: '1' memory: 2Gi useRedHatImage: false" | kubectl apply -f - nexus.apps.m88i.io/nexus3 created ╰─ kubectl get pods -w NAME READY STATUS RESTARTS AGE nexus3-558f8fcd68-dmh5c 0/1 ContainerCreating 0 6s nexus3-558f8fcd68-dmh5c 0/1 Running 0 41s nexus3-558f8fcd68-dmh5c 1/1 Running 0 4m46s ^C% ``` All good. Now let's change the memory limit to 2512 MiB: ``` ╰─ echo "apiVersion: apps.m88i.io/v1alpha1 kind: Nexus metadata: name: nexus3 spec: networking: expose: false persistence: persistent: false replicas: 1 resources: limits: cpu: '2' memory: 2512Mi # notice this changed requests: cpu: '1' memory: 2Gi useRedHatImage: false" | kubectl apply -f - && kubectl get pods -w nexus.apps.m88i.io/nexus3 configured NAME READY STATUS RESTARTS AGE nexus3-558f8fcd68-dmh5c 1/1 Running 0 6m19s nexus3-6f7b5858b9-wv6wq 0/1 Pending 0 0s nexus3-6f7b5858b9-wv6wq 0/1 Pending 0 0s nexus3-6f7b5858b9-wv6wq 0/1 ContainerCreating 0 0s nexus3-6f7b5858b9-wv6wq 0/1 Running 0 3s nexus3-6f7b5858b9-wv6wq 1/1 Running 0 4m4s nexus3-558f8fcd68-dmh5c 1/1 Terminating 0 10m nexus3-558f8fcd68-dmh5c 0/1 Terminating 0 10m nexus3-558f8fcd68-dmh5c 0/1 Terminating 0 10m nexus3-558f8fcd68-dmh5c 0/1 Terminating 0 10m ^C% ╰─ kubectl get pods NAME READY STATUS RESTARTS AGE nexus3-6f7b5858b9-wv6wq 1/1 Running 0 4m37s ``` The newer deployment faced no issues and the previous was terminated as expected, when the newer was available. Of course, this is not Openshift. I can't test it myself on Openshift, but we'll see if we can test it there as well. @ricardozanini @Kaitou786 if any of you could try replicating the issue on OCP that would be great :-)

ricardozanini commented 3 years ago

@LCaparelli I believe he also has a PV updated with some information. That way, Nexus will lock the data directory, preventing the rollingupdate.

If this is the case, we must change to "Recreate", since even for updating we won't be able to do it with the data folder locked. Or at least, signal to the server to unlock the data directory before performing the update.

LCaparelli commented 3 years ago

@ricardozanini So if I enable persistence I should run into this issue, right? Let me give that a swing

LCaparelli commented 3 years ago

Ah yes, indeed. I have reproduced the same issue. Simply using Recreate when persistence is enabled would do the trick, but would bring availability issues for automatic updates with persistence. I'll give it some further thought, perhaps there's a way to deal with this without negative outcomes.

At the moment no action from you is requested @bszeti, thanks again for reporting it. :-)

Click to expand

Let's enable persistence and go back to 2 GiB limit: ``` ╰─ echo "apiVersion: apps.m88i.io/v1alpha1 kind: Nexus metadata: name: nexus3 spec: networking: expose: false persistence: persistent: true replicas: 1 resources: limits: cpu: '2' memory: 2Gi requests: cpu: '1' memory: 2Gi useRedHatImage: false" | kubectl apply -f - && kubectl get pods -w nexus.apps.m88i.io/nexus3 configured NAME READY STATUS RESTARTS AGE nexus3-6f7b5858b9-wv6wq 0/1 Running 1 30m nexus3-7d7d54d566-5k96q 0/1 Pending 0 0s nexus3-7d7d54d566-5k96q 0/1 Pending 0 1s nexus3-7d7d54d566-5k96q 0/1 Pending 0 1s nexus3-7d7d54d566-5k96q 0/1 ContainerCreating 0 2s nexus3-7d7d54d566-5k96q 0/1 Running 0 19s nexus3-6f7b5858b9-wv6wq 1/1 Running 1 33m nexus3-7d7d54d566-5k96q 1/1 Running 0 4m19s nexus3-6f7b5858b9-wv6wq 1/1 Terminating 1 35m nexus3-6f7b5858b9-wv6wq 0/1 Terminating 1 35m nexus3-6f7b5858b9-wv6wq 0/1 Terminating 1 35m nexus3-6f7b5858b9-wv6wq 0/1 Terminating 1 35m ^C% ╰─ kubectl get pod NAME READY STATUS RESTARTS AGE nexus3-7d7d54d566-5k96q 1/1 Running 0 8m ``` And then change the memory limit: ``` ╰─ echo "apiVersion: apps.m88i.io/v1alpha1 kind: Nexus metadata: name: nexus3 spec: networking: expose: false persistence: persistent: true replicas: 1 resources: limits: cpu: '2' memory: 2512Mi # notice this changed requests: cpu: '1' memory: 2Gi useRedHatImage: false" | kubectl apply -f - && kubectl get pods -w nexus.apps.m88i.io/nexus3 configured NAME READY STATUS RESTARTS AGE nexus3-7d7d54d566-5k96q 1/1 Running 0 8m28s nexus3-54f857b97f-vqzrj 0/1 Pending 0 1s nexus3-54f857b97f-vqzrj 0/1 Pending 0 1s nexus3-54f857b97f-vqzrj 0/1 ContainerCreating 0 1s nexus3-54f857b97f-vqzrj 0/1 Running 0 4s nexus3-54f857b97f-vqzrj 0/1 Completed 0 6s nexus3-54f857b97f-vqzrj 0/1 Running 1 7s nexus3-54f857b97f-vqzrj 0/1 Completed 1 8s nexus3-54f857b97f-vqzrj 0/1 CrashLoopBackOff 1 16s ^C% ╰─ kubectl describe pod nexus3-54f857b97f-vqzrj Name: nexus3-54f857b97f-vqzrj Namespace: default Priority: 0 Node: minikube/172.17.0.2 Start Time: Tue, 01 Dec 2020 12:28:26 -0300 Labels: app=nexus3 pod-template-hash=54f857b97f Annotations: Status: Running IP: 172.18.0.8 IPs: IP: 172.18.0.8 Controlled By: ReplicaSet/nexus3-54f857b97f Containers: nexus-server: Container ID: docker://5f230d740cf43a90c0f6d6c88d5b008f57d5c0c29230e127aa90304a9b6f2a5f Image: docker.io/sonatype/nexus3:3.28.1 Image ID: docker-pullable://sonatype/nexus3@sha256:e788154207df95a86287fc8ecae1a5f789e74d0124f2dbda846fc4c769603bdb Port: 8081/TCP Host Port: 0/TCP State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Completed Exit Code: 0 Started: Tue, 01 Dec 2020 12:28:31 -0300 Finished: Tue, 01 Dec 2020 12:28:32 -0300 Ready: False Restart Count: 1 Limits: cpu: 2 memory: 2512Mi Requests: cpu: 1 memory: 2Gi Liveness: http-get http://:8081/service/rest/v1/status delay=240s timeout=15s period=10s #success=1 #failure=3 Readiness: http-get http://:8081/service/rest/v1/status delay=240s timeout=15s period=10s #success=1 #failure=3 Environment: INSTALL4J_ADD_VM_PARAMS: -Djava.util.prefs.userRoot=${NEXUS_DATA}/javaprefs -Dnexus.security.randompassword=false -XX:MaxDirectMemorySize=2635m -Xms2108m -Xmx2108m Mounts: /nexus-data from nexus3-data (rw) /var/run/secrets/kubernetes.io/serviceaccount from nexus3-token-cktvh (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: nexus3-data: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: nexus3 ReadOnly: false nexus3-token-cktvh: Type: Secret (a volume populated by a Secret) SecretName: nexus3-token-cktvh Optional: false QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled Successfully assigned default/nexus3-54f857b97f-vqzrj to minikube Normal Pulled 21s (x2 over 24s) kubelet, minikube Container image "docker.io/sonatype/nexus3:3.28.1" already present on machine Normal Created 21s (x2 over 24s) kubelet, minikube Created container nexus-server Normal Started 21s (x2 over 24s) kubelet, minikube Started container nexus-server Warning BackOff 11s (x2 over 19s) kubelet, minikube Back-off restarting failed container ```

bszeti commented 3 years ago

Hi, Thanks for looking into this.

Yes, of course the issue only shows up if you use persistence. Nexus has a lock file, so the new Pod can't start until the old one is running. (By the way isn't this a problem if number of Nexus replicas is greater than one??)

Install operator:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: nexus
spec:
  targetNamespaces: 
  - nexus
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: nexus-operator-m88i
spec:
  channel: alpha
  name: nexus-operator-m88i
  source: community-operators
  sourceNamespace: openshift-marketplace

Install Nexus:

apiVersion: apps.m88i.io/v1alpha1
kind: Nexus
metadata:
  name: nexus3
spec:
  resources:
    limits:
      cpu: '2'
      memory: 2Gi
    requests:
      cpu: 1000m
      memory: 2Gi
  useRedHatImage: true
  serverOperations:
    disableOperatorUserCreation: false
  imagePullPolicy: Always
  networking:
    expose: true
    exposeAs: Route
    tls:
      mandatory: true
  replicas: 1
  persistence:
    persistent: true
    volumeSize: 10Gi

ricardozanini commented 3 years ago

Hi @bszeti, yes it's a problem. Only horizontal scaling is supported at this time until we implement #61 I'll also take a look into the Nexus documentation to see if I can figure out another workaround for this problem.

m88i / nexus-operator

Pod fails to start after modifying the Nexus resource #191