schrej / podmigration-operator

Part of a PoC of Pod migration in Kubernetes
Apache License 2.0
26 stars 4 forks source link

a YAML file #2

Closed mind-mind closed 3 years ago

mind-mind commented 3 years ago

Hi again,

I still have an issue with a YAML file. I can test pod migrating with your YAML file and it works without any problem, but my YAML file has a problem.

I think the one thing that I know. Your YAML file doesn't have a "command" on the YAML file, but I'm not sure why my YAML doesn't work. Maybe because of this or another thing.

I think I got some point about if I want to use the volume on the YAML file should config something, but I didn't use volumes, and it still doesn't work anyway.

It has the status "CrashLoopBackOff." in the destination pod after migrating. Any Idea about this?

mind-mind commented 3 years ago

I tried your YAML file 1.yaml, it has the same problem. I think command might have a problem. I will try another YAML file without a command to prove it.

schrej commented 3 years ago

Using commands did work in my tests. As far as I remember, all files in /evaluation worked fine. It might be tty: true or the security context that causes the migration to fail. Have you checked the kubelet logs on both machines?

mind-mind commented 3 years ago

I did not check the logs yet. I will check and come back. I think I figure it out. Thank you. I will test and come back.

mind-mind commented 3 years ago

Using commands did work in my tests. As far as I remember, all files in /evaluation worked fine. It might be tty: true or the security context that causes the migration to fail. Have you checked the kubelet logs on both machines?

I test simple.yaml on evaluation, but got this.

error: unable to recognize "simple.yaml": no matches for kind "MigratingPod" in version "podmig.schrej.net/v1"

I guess I might install something to use MigratingPod right?

mind-mind commented 3 years ago

I already check kubelet logs on both machines

For me, I think it normal but doesn't work.

This is 1.yaml file, but 2.yaml has no problem on my side before. However, 2.yaml has the same problem now.

The source logs is 1 2 3 4 .. 16

Here is kubelet log in the destination pod.

image

failed to try resolving symlinks in path "/var/log/pods/default_simple-migration-28_4212e1bd-24a0-4c6f-9c0e-054d3561fef1/count/6.log": lstat /var/log/pods/default_simple-migration-28_4212e1bd-24a0-4c6f-9c0e-054d3561fef1/count/6.log: no such file or directory

worker 1

● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Wed 2021-05-26 14:41:44 UTC; 56min ago
     Docs: https://kubernetes.io/docs/home/
 Main PID: 5878 (kubelet)
    Tasks: 14 (limit: 1140)
   CGroup: /system.slice/kubelet.service
           └─5878 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/e

May 26 15:37:09 w1 kubelet[5878]: I0526 15:37:09.533550    5878 kuberuntime_manager.go:841] Should we migrate?Pe
May 26 15:37:15 w1 kubelet[5878]: I0526 15:37:15.663664    5878 kuberuntime_manager.go:841] Should we migrate?Pe
May 26 15:37:16 w1 kubelet[5878]: I0526 15:37:16.668478    5878 kuberuntime_manager.go:841] Should we migrate?Pe
May 26 15:37:22 w1 kubelet[5878]: I0526 15:37:22.458500    5878 kuberuntime_manager.go:841] Should we migrate?Ru
May 26 15:37:40 w1 kubelet[5878]: I0526 15:37:40.961927    5878 kuberuntime_manager.go:841] Should we migrate?Ru
May 26 15:37:41 w1 kubelet[5878]: I0526 15:37:41.696242    5878 kubelet.go:1505] Checkpoint the firstime running
May 26 15:37:41 w1 kubelet[5878]: E0526 15:37:41.696913    5878 remote_runtime.go:289] CheckpointContainer "8390
May 26 15:37:41 w1 kubelet[5878]: I0526 15:37:41.697614    5878 kuberuntime_manager.go:841] Should we migrate?Ru
May 26 15:37:48 w1 kubelet[5878]: I0526 15:37:48.216159    5878 kubelet.go:1505] Checkpoint the firstime running
May 26 15:37:48 w1 kubelet[5878]: E0526 15:37:48.217212    5878 remote_runtime.go:289] CheckpointContainer "8390

Worker2

● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Wed 2021-05-26 14:41:43 UTC; 56min ago
     Docs: https://kubernetes.io/docs/home/
 Main PID: 5993 (kubelet)
    Tasks: 13 (limit: 1140)
   CGroup: /system.slice/kubelet.service
           └─5993 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfi

May 26 15:37:41 w2 kubelet[5993]: W0526 15:37:41.058713    5993 watcher.go:87] Error while processing event 
May 26 15:37:41 w2 kubelet[5993]: W0526 15:37:41.060322    5993 watcher.go:87] Error while processing event 
May 26 15:37:41 w2 kubelet[5993]: W0526 15:37:41.060554    5993 watcher.go:87] Error while processing event 
May 26 15:37:41 w2 kubelet[5993]: W0526 15:37:41.060765    5993 watcher.go:87] Error while processing event 
May 26 15:37:41 w2 kubelet[5993]: I0526 15:37:41.565270    5993 kuberuntime_manager.go:841] Should we migrat
May 26 15:37:47 w2 kubelet[5993]: E0526 15:37:47.309993    5993 remote_runtime.go:306] RestoreContainer "b3a
May 26 15:37:47 w2 kubelet[5993]: I0526 15:37:47.915065    5993 topology_manager.go:219] [topologymanager] R
May 26 15:37:47 w2 kubelet[5993]: I0526 15:37:47.915884    5993 kuberuntime_manager.go:841] Should we migrat
May 26 15:37:49 w2 kubelet[5993]: E0526 15:37:49.964690    5993 remote_runtime.go:306] RestoreContainer "2b3
May 26 15:37:50 w2 kubelet[5993]: I0526 15:37:50.615102    5993 kuberuntime_manager.go:841] Should we migrat

but

image

make run

2021-05-26T15:25:27.781Z    INFO    controller-runtime.metrics  metrics server is starting to listen    {"addr": ":8081"}
2021-05-26T15:25:27.782Z    INFO    setup   starting manager
2021-05-26T15:25:27.883Z    INFO    controller-runtime.manager  starting metrics server{"path": "/metrics"}
2021-05-26T15:25:27.883Z    INFO    controller  Starting EventSource    {"reconcilerGroup": "podmig.dcn.ssu.ac.kr", "reconcilerKind": "Podmigration", "controller": "podmigration", "source": "kind source: /, Kind="}
2021-05-26T15:25:27.984Z    INFO    controller  Starting Controller {"reconcilerGroup": "podmig.dcn.ssu.ac.kr", "reconcilerKind": "Podmigration", "controller": "podmigration"}
2021-05-26T15:25:27.984Z    INFO    controller  Starting workers    {"reconcilerGroup": "podmig.dcn.ssu.ac.kr", "reconcilerKind": "Podmigration", "controller": "podmigration", "worker count": 1}
2021-05-26T15:37:40.947Z    INFO    controllers.Podmigration        {"podmigration": "default/simple-migration-controller-18", "print test": {"sourcePod":"simple","destHost":"w2","selector":{"matchLabels":{"podmig":"dcn"}},"template":{"metadata":{"creationTimestamp":null},"spec":{"containers":[]}},"action":"live-migration"}}
2021-05-26T15:37:40.949Z    INFO    controllers.Podmigration        {"podmigration": "default/simple-migration-controller-18", "annotations ": ""}
2021-05-26T15:37:40.949Z    INFO    controllers.Podmigration        {"podmigration": "default/simple-migration-controller-18", "number of existing pod ": 0}
2021-05-26T15:37:40.949Z    INFO    controllers.Podmigration        {"podmigration": "default/simple-migration-controller-18", "desired pod ": {"namespace": "default", "name": ""}}
2021-05-26T15:37:40.949Z    INFO    controllers.Podmigration        {"podmigration": "default/simple-migration-controller-18", "number of desired pod ": 0}
2021-05-26T15:37:40.950Z    INFO    controllers.Podmigration        {"podmigration": "default/simple-migration-controller-18", "number of actual running pod ": 0}
2021-05-26T15:37:40.974Z    INFO    controllers.Podmigration        {"podmigration": "default/simple-migration-controller-18", "Live-migration": "Step 1 - Check source pod is exist or not - completed"}
2021-05-26T15:37:40.974Z    INFO    controllers.Podmigration        {"podmigration": "default/simple-migration-controller-18", "sourcePod ok ": {"apiVersion": "v1", "kind": "Pod", "namespace": "default", "name": "simple"}}
2021-05-26T15:37:40.974Z    INFO    controllers.Podmigration        {"podmigration": "default/simple-migration-controller-18", "sourcePod status ": "Running"}
2021-05-26T15:37:40.981Z    INFO    controllers.Podmigration        {"podmigration": "default/simple-migration-controller-18", "Live-migration": "Step 2 - checkpoint source Pod - completed"}
2021-05-26T15:37:40.981Z    INFO    controllers.Podmigration        {"podmigration": "default/simple-migration-controller-18", "live-migration pod": "count"}
2021-05-26T15:37:40.981Z    INFO    controllers.Podmigration        {"podmigration": "default/simple-migration-controller-18", "Live-migration": "checkpointPath/var/lib/kubelet/migration/kkk/simple"}
2021-05-26T15:37:40.981Z    INFO    controllers.Podmigration        {"podmigration": "default/simple-migration-controller-18", "Live-migration": "Step 3 - Wait until checkpoint info are created - completed"}
2021-05-26T15:37:40.988Z    INFO    controllers.Podmigration        {"podmigration": "default/simple-migration-controller-18", "Live-migration": "Step 4 - Restore destPod from sourcePod's checkpointed info - completed"}
2021-05-26T15:37:48.210Z    INFO    controllers.Podmigration        {"podmigration": "default/simple-migration-controller-18", "Live-migration": "Step 4.1 - Check whether if newPod is Running or not - completedsimple-migration-28Running"}
2021-05-26T15:37:48.210Z    INFO    controllers.Podmigration        {"podmigration": "default/simple-migration-controller-18", "Live-migration": "Step 4.1 - Check whether if newPod is Running or not - completed"}
2021-05-26T15:37:48.216Z    INFO    controllers.Podmigration        {"podmigration": "default/simple-migration-controller-18", "Live-migration": "Step 6 - Delete the source pod - completed"}
2021-05-26T15:37:48.216Z    DEBUG   controller  Successfully Reconciled {"reconcilerGroup": "podmig.dcn.ssu.ac.kr", "reconcilerKind": "Podmigration", "controller": "podmigration", "name": "simple-migration-controller-18", "namespace": "default"}

go run ./api-server/cmd/main.go

2021-05-26T15:25:43.176Z    INFO    podmigration-cp.run starting api-server manager
2021-05-26T15:25:43.177Z    INFO    api-server  Starting api-server {"interface": "0.0.0.0", "port": ":5000"}
&{simple-migration-controller-18 w2 0 &LabelSelector{MatchLabels:map[string]string{podmig: dcn,},MatchExpressions:[]LabelSelectorRequirement{},} live-migration  simple {{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []} {[] [] [] []  <nil> <nil>  map[]   <nil>  false false false <nil> nil []   nil  [] []  <nil> nil [] <nil> <nil> <nil> map[] [] <nil> }} <nil>}
simple
mind-mind commented 3 years ago

I found a problem now. Thank you

schrej commented 3 years ago

Oh sorry, I already had an answer half typed but got distracted. Glad that you've solved it in the meantime!

mind-mind commented 3 years ago

Well, I got another problem actually, After fixing this, I got

image (1)

May 28 06:17:29 w2 kubelet[7436]: E0528 06:17:29.576763  7436 remote_runtime.go:306] RestoreContainer “090010a838376a329cfe2668559c46ab1d2a64306108a75f50daec136ea7efe0” from runtime service failed: rpc error: code = Unknown desc = failed to restore container: failed to start containerd task “090010a838376a329cfe2668559c46ab1d2a64306108a75f50daec136ea7efe0": OCI runtime restore failed: open /var/lib/kubelet/migration/ooo/video/vlc/descriptors.json: no such file or directory: unknown
May 28 06:17:30 w2 kubelet[7436]: I0528 06:17:30.452018  7436 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: 6b2d7355cd9271dec30c75d75b344edf519f11c04d6da69f4e15142daeaac79b
May 28 06:17:30 w2 kubelet[7436]: I0528 06:17:30.452769  7436 topology_manager.go:219] [topologymanager] RemoveContainer - Container ID: 090010a838376a329cfe2668559c46ab1d2a64306108a75f50daec136ea7efe0
May 28 06:17:30 w2 kubelet[7436]: I0528 06:17:30.453081  7436 kuberuntime_manager.go:841] Should we migrate?Runningtrue
mind-mind commented 3 years ago

Also, I told with Tuong. He recommends using

$ kubectl checkpoint simple /var/lib/kubelet/migration/xxx

checkpoint and I got this

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x122f8da]

goroutine 1 [running]:
k8s.io/client-go/kubernetes.NewForConfig(0x0, 0x0, 0x14f5141, 0x58)
    /home/ubuntu/kubernetes/staging/src/k8s.io/client-go/kubernetes/clientset.go:371 +0x3a
main.(*MigrateArgs).Run(0xc000361230, 0xc00035ea00, 0xc000355020)
    /home/ubuntu/podmigration-operator/kubectl-plugin/checkpoint-command/checkpoint_command.go:88 +0x73
main.NewPluginCmd.func1(0xc00035ea00, 0xc000355020, 0x2, 0x2)
    /home/ubuntu/podmigration-operator/kubectl-plugin/checkpoint-command/checkpoint_command.go:61 +0xd3
github.com/spf13/cobra.(*Command).execute(0xc00035ea00, 0xc000114160, 0x2, 0x2, 0xc00035ea00, 0xc000114160)
    /home/ubuntu/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0xc00035ea00, 0x0, 0xffffffff, 0xc000102058)
    /home/ubuntu/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914 +0x30b
github.com/spf13/cobra.(*Command).Execute(...)
    /home/ubuntu/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
main.main()
    /home/ubuntu/podmigration-operator/kubectl-plugin/checkpoint-command/checkpoint_command.go:130 +0x2a

I don't know what should I fix now.

May 28 14:49:35 w1 kubelet[5195]: E0528 14:49:35.089879  5195 remote_runtime.go:289] CheckpointContainer "b0541936954521367fdcd022b54e9e44e2350469daf549616841bbf2263173c5" from runtime service failed: rpc error: code = Unknown desc = failed to checkpoint container: /usr/local/bin/runc did not terminate sucessfully: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v1.linux/k8s.io/b0541936954521367fdcd022b54e9e44e2350469daf549616841bbf2263173c5/criu-dump.log: unknown

Seems like the problem with CRIU

mind-mind commented 3 years ago

I don't know what to say. I think Ubuntu still has a problem with CRIU. I don' know why you guys can use it. I test your work on Debian 10. it works without any problems.