utkuozdemir / pv-migrate

CLI tool to easily migrate Kubernetes persistent volumes
Apache License 2.0
1.57k stars 76 forks source link

Receiving "Deployment is not ready" error while the deployment is ready actually #264

Open MurzNN opened 9 months ago

MurzNN commented 9 months ago

Describe the bug When I start the pv-migrate, it creates the deployment, but in the debug log I see errors like:

๐Ÿš Attempting strategy: lbsvc
๐Ÿ”‘ Generating SSH key pair
creating 4 resource(s)
beginning wait for 4 resources with timeout of 1m0s
Deployment is not ready: korepov/pv-migrate-dbabc-src-sshd. 0 out of 1 expected pods are ready
Deployment is not ready: korepov/pv-migrate-dbabc-src-sshd. 0 out of 1 expected pods are ready
Deployment is not ready: korepov/pv-migrate-dbabc-src-sshd. 0 out of 1 expected pods are ready
Deployment is not ready: korepov/pv-migrate-dbabc-src-sshd. 0 out of 1 expected pods are ready

But at the same time, via kubectl I see that the deployment is ready:

$ kubectl -n korepov get deployment pv-migrate-dbabc-src-sshd 
NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
pv-migrate-dbabc-src-sshd   1/1     1            1           43s

The log level is debug, and no additional messages were displayed.

So, any ideas on what can cause this problem?

How can I enable more verbose logging to understand what's happening and why it is not detecting the ready status?

Console output

๐Ÿš Attempting strategy: lbsvc
๐Ÿ”‘ Generating SSH key pair
creating 4 resource(s)
beginning wait for 4 resources with timeout of 1m0s
Deployment is not ready: korepov/pv-migrate-dbabc-src-sshd. 0 out of 1 expected pods are ready
Deployment is not ready: korepov/pv-migrate-dbabc-src-sshd. 0 out of 1 expected pods are ready
Deployment is not ready: korepov/pv-migrate-dbabc-src-sshd. 0 out of 1 expected pods are ready
Deployment is not ready: korepov/pv-migrate-dbabc-src-sshd. 0 out of 1 expected pods are ready
๐Ÿงน Cleaning up
uninstall: Deleting pv-migrate-dbabc-src
uninstall: given cascade value: , defaulting to delete propagation background
Starting delete for "pv-migrate-dbabc-src-sshd" Service
Starting delete for "pv-migrate-dbabc-src-sshd" Deployment
Starting delete for "pv-migrate-dbabc-src-sshd" Secret
Starting delete for "pv-migrate-dbabc-src-sshd" ServiceAccount
beginning wait for 4 resources to be deleted with timeout of 1m0s
purge requested for pv-migrate-dbabc-src
โœจ Cleanup done
๐Ÿ”ถ Migration failed with this strategy, will try with the remaining strategies
Error: migration failed: all strategies failed for this migration

 - Source and destination Kubernetes versions: source - `v1.25.6`, destination - ` v1.27.7`
 - Source and destination container runtimes: source - `containerd://1.6.15`, destination - `containerd://1.7.5`
 - pv-migrate version 1.7.1 (commit: 1affa11b175d20969b9d6f2879c09dc94f0b4a0f) (build date: 2023-10-09T21:56:55Z)
 - Installation method: krew
 - Source and destination PVC type, size and accessModes: `ReadWriteMany, csi-cephfs-sc, 2G -> ReadWriteMany, 
local-path, 2G` 
MurzNN commented 9 months ago

And here is the output of the all resources, related to the process, while I see the "Deployment is not ready" error:

$ kubectl -n korepov get all | grep pv-migrate
pod/pv-migrate-dbddb-src-sshd-cf79c787-d2nph   1/1     Running   0               18s
service/pv-migrate-dbddb-src-sshd    NodePort     <none>        22:32148/TCP                 20s
deployment.apps/pv-migrate-dbddb-src-sshd   1/1     1            1           19s
replicaset.apps/pv-migrate-dbddb-src-sshd-cf79c787   1         1         1       19s
utkuozdemir commented 9 months ago

This looks like a bug, I'll have a look. You can get more info by --log-level=debug --log-format=json, but not sure if it's gonna help here.

MurzNN commented 9 months ago

Thanks! I already have --log-level=debug and --log-format=json just adds more garbage to the output, but not new useful information ;) Maybe you can explain how to debug this on my side? And I will share more debugging information for you.

utkuozdemir commented 9 months ago

I had a look and noticed that this error comes from Helm's wait logic, not from our code. So I would give a try to pass --skip-cleanup and try to troubleshoot it using helm cli, trying to find out why it does not report as ready. You can give a try to

helm ls -a
helm status <name-of-the-release>

Also, note that for lbsvc, Helm would wait for the created Service to actually get an external IP (not pending). This could be the problem.

MurzNN commented 9 months ago

Tested, even without --skip-cleanup - it shows as deployed, while in the terminal I see coming lines:

Deployment is not ready: korepov/pv-migrate-dcada-src-sshd. 0 out of 1 expected pods are ready

Here is the output of helm:

$ helm status pv-migrate-dcada-src
NAME: pv-migrate-dcada-src
LAST DEPLOYED: Wed Dec 13 15:12:42 2023
NAMESPACE: korepov
STATUS: deployed
MurzNN commented 9 months ago

Seems this problem is related to the NodePort service type mode. I can't test it with LoadBalancer type because no free IPs are available for it on the source cluster.

But I tested on the destination cluster (just test the copy back), and with LoadBalancer it works well, but with NodePort I'm receiving the same error.

While the pv-migrate waits for readiness, I see the Service in the active state, here are the details:

$ kubectl describe service pv-migrate-bdaea-src-sshd
Name:                     pv-migrate-bdaea-src-sshd
Namespace:                korepov-pro-dev
Labels:                   app.kubernetes.io/component=sshd
Annotations:              meta.helm.sh/release-name: pv-migrate-bdaea-src
                          meta.helm.sh/release-namespace: korepov-pro-dev
Selector:                 app.kubernetes.io/component=sshd,app.kubernetes.io/instance=pv-migrate-bdaea-src,app.kubernetes.io/name=pv-migrate
Type:                     NodePort
IP Family Policy:         SingleStack
IP Families:              IPv4
Port:                     ssh  22/TCP
TargetPort:               22/TCP
NodePort:                 ssh  31784/TCP
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

And I can connect to this node port on the source cluster from the destination cluster (using the externtal IP of any node) via telnet:

# telnet 31784
Connected to
Escape character is '^]'.

So, the network connection is not a problem.

So, could you please describe what exactly it tries to wait? And maybe make the more verbose debug logging to cath it?

MurzNN commented 9 months ago

Also, specifying the source node IP address explicitly using --dest-host-override doesn't help too.

MurzNN commented 9 months ago

And will be good to add to the debug logs the output of the Helm chart deployment status, at least helm status, but better - also the pod and service status.