suxess-it / kubriX

https://kubrix.io
19 stars 4 forks source link

[argocd] virt-launcher pod stays progressing #92

Open jkleinlercher opened 7 months ago

jkleinlercher commented 7 months ago

virt-launcher pods stay on progressing https://argocd-metalstack.platform-engineer.cloud/applications/argocd/m-qa?view=tree&resource=&node=%2FPod%2Fqa-demo-kubevirt%2Fvirt-launcher-m-zktvn%2F0

jkleinlercher commented 7 months ago

next step: compare how openshift-gitops and VirtualMachine pods behave on openshift

phac008 commented 7 months ago

same situation in OpenShift when vm is applied via gitops

investigating...

phac008 commented 7 months ago

Openshift: due to live migration messages in condition - changing hco settings evictionStrategy and WorkloadUpdateStrategy ( https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/cluster-configuration.md )

still progressing

phac008 commented 7 months ago

https://github.com/argoproj/argo-cd/issues/7175 -> code merged, no solution

jkleinlercher commented 7 months ago

looks like we have the same behaviour described in https://github.com/argoproj/argo-cd/issues/15317

also pods created from kubevirt VirtualMachineInstance have "restartPolicy: never" defined. Don't know if we can change that and which consequences that has —> Seems to be hardcoded in https://github.com/kubevirt/kubevirt/blob/ea53cc9d444227a033c55d521979e6ccc688456f/pkg/virt-controller/services/template.go#L583

kubectl get pods -n qa-demo-kubevirt -o yaml |grep restart
    restartPolicy: Never
jkleinlercher commented 7 months ago

With that said, as Long as the application state is healthy, maybe this issue is not that important?

jkleinlercher commented 7 months ago

opened kubevirt issue https://github.com/kubevirt/kubevirt/issues/11813

jkleinlercher commented 7 months ago

In the meantime we can try to create a lua health script but very specific to vm-launcher pods with restartPolicy never to prevent any side effects for other pods and jobs.

jkleinlercher commented 7 months ago

problem is, that the health logic for pods is quite complex: https://github.com/argoproj/gitops-engine/blob/fbecbb86e41254a75a59943b5eb43ed55d21cdb9/pkg/health/health_pod.go#L29 On slack I found a person who also tried to add some health logic for deployment, without rewriting the whole health logic, see https://cloud-native.slack.com/archives/D0720GKMCS1/p1714978483287869 Maybe he has some tips how to write it