migtools / mig-controller

OpenShift Migration Controller
Apache License 2.0
22 stars 42 forks source link

staging pod is not created in migmigration #694

Open amruhela opened 3 years ago

amruhela commented 3 years ago

Describe the bug I am having 6 Node OC Cluster with 3 availability zones in IBM Cloud. Issue: While running a migmigration in CAM, staging pod is getting into pending state with error: Events: Type Reason Age From Message


Warning FailedScheduling default-scheduler error while running "VolumeBinding" filter plugin for pod "stage-mysql-1-ftcz4-wwf9x": pod has unbound immediate PersistentVolumeClaims Warning FailedScheduling default-scheduler error while running "VolumeBinding" filter plugin for pod "stage-mysql-1-ftcz4-wwf9x": pod has unbound immediate PersistentVolumeClaims Warning FailedScheduling default-scheduler 0/6 nodes are available: 6 node(s) had volume node affinity conflict.

Mig-Migration yaml: cat .\mig-migration.yaml apiVersion: migration.openshift.io/v1alpha1 kind: MigMigration metadata: labels: controller-tools.k8s.io: "1.0" name: migplan-robotshop namespace: openshift-migration spec: stage: true quiescePods: false keepAnnotations: true migPlanRef: name: migplan-robotshop namespace: openshift-migration

+++++++++++++++++++++++++++++++++++++++++++++++ Describe of mig-migration: Spec: Keep Annotations: true Mig Plan Ref: Name: migplan-robotshop Namespace: openshift-migration Stage: true Status: Conditions: Category: Advisory Last Transition Time: 2020-09-28T09:52:01Z Message: Step: 18/23 ###################Stuck on step 18. Reason: StageRestoreCreated Status: True Type: Running Category: Required Last Transition Time: 2020-09-28T09:50:31Z Message: The migration is ready. Status: True Type: Ready Category: Advisory Durable: true Last Transition Time: 2020-09-28T09:50:40Z Message: [3] Stage pods created. Status: True Type: StagePodsCreated Itenerary: Stage Observed Digest: 55d57c4a98f9890261888763f80f61b73304fc871f8d54060f249dd25a517aaa Phase: StageRestoreCreated Start Timestamp: 2020-09-28T09:50:31Z Events:

Please advise here on the fix for this issue.

pranavgaikwad commented 3 years ago

Same symptoms as #657

The controller currently only checks whether the PVCs are bound for any non-running pod. Ideally, the check should be on the Running condition. Meaning, do not proceed until all stage pods are found Running no matter what the reason.

@alaypatel07 @sseago

ganesan-cmd commented 3 years ago

@pranavgaikwad , Thanks for your response,Myself and Aman are working on the same Project. In our case staging PVC's are in Bounded state and one POD is in pending state. MigMigration status is not moving from"StageRestoreCreated". is there any dependency we are missing in the target for this application migration?

[root@oc8660015353 ~]# oc get pvc -n robot-shop NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE mongodb-volume-claim Bound pvc-418aefe8-0787-4f3d-b2ca-d7f84a677fea 20Gi RWO ibmc-block-bronze 17m mysql-data-volume-claim Bound pvc-8e866fb9-b236-4f5b-aac8-fe4446f6950b 20Gi RWO ibmc-block-bronze 17m mysql-scripts-volume-claim Bound pvc-75d3a0bb-0ad2-4c9f-98a2-d86a0d0b1435 20Gi RWO ibmc-block-bronze 17m redis-volume-claim Bound pvc-cf78a01d-f694-4b59-b3de-83b2b5a31da8 20Gi RWO ibmc-block-bronze 17m [root@oc8660015353 ~]# oc get pods -n robot-shop NAME READY STATUS RESTARTS AGE stage-mongodb-6b47bd444f-zh7dg-r7gq2 1/1 Running 0 17m stage-mysql-1-ftcz4-krq9t 0/2 Pending 0 17m stage-redis-b4cb56cbb-2bctn-h4kkp 1/1 Running 0 17m [root@oc8660015353 ~]#

MIGMIGRATION STATUS:-

Status: Conditions: Category: Advisory Last Transition Time: 2020-09-28T15:47:38Z Message: Step: 21/33 Reason: StageRestoreCreated Status: True Type: Running Category: Required Last Transition Time: 2020-09-28T15:45:20Z Message: The migration is ready. Status: True Type: Ready Category: Advisory Durable: true Last Transition Time: 2020-09-28T15:46:17Z Message: [3] Stage pods created. Status: True Type: StagePodsCreated Itenerary: Final Observed Digest: 9e1ff10150d7f87422ce54094a203e59d56da061c17105282683a39fa7b113d4 Phase: StageRestoreCreated Start Timestamp: 2020-09-28T15:45:20Z Events: [root@oc8660015353 ~]#

@alaypatel07 @sseago

amruhela commented 3 years ago

@pranavgaikwad Can you please advice why the stage pods are not getting created in our scenario. Let me know if you need further logs or something.

pranavgaikwad commented 3 years ago

@amruhela sorry for the delayed response. I can certainly help you figure this one out. It would be helpful to paste the yaml definition of Stage pods created on the source cluster and the ones created on the destination cluster. The events log you pasted shows volume node affinity conflict. I do not have much knowledge about the storage class you're using. But, for gp2 (AWS EBS volumes), this error is usually caused when the Pod requesting the volume is scheduled on a node which is not in the same region in which the volume is present.

ganesan-cmd commented 3 years ago

@pranavgaikwad

Thanks for your response, We are getting the same error while we installing and migrating the robot-shop and we attache dthe detailed logs in the issue https://github.com/konveyor/mig-demo-apps/issues/24 . please check and help us to resolve this issue and let us know if you need any further information.

pranavgaikwad commented 3 years ago

@ganesan-cmd @amruhela

I found out the root cause of the failing stage pods issue. Here are my findings:

Stage pods are temporary pods created by mig-controller. The volumes to be migrated are attached to the stage pods. Stage pods and the original pod share the same volume. Stage pods are scheduled on the same node in which the original Pod using the volume is present. The assumption here is that the volumes can be shared between multiple pods when the pods are scheduled on the same physical node. This assumption holds true for most storage providers we have seen. However, it is not true for IBM Block storage. I confirmed this by testing it on IBM Cloud. The volumes in ROKS environment cannot be shared between multiple pods even if they are scheduled on the same node.

Here's a small gist I created for you to confirm this issue: https://gist.github.com/pranavgaikwad/a03c7ba84594621f914eccf6362a3231 It has a sample manifest that launches two pods which share the same PVC. You need to update the nodeSelector with values applicable to your environment on both the pods to force schedule them on the same node.

Unfortunately, this problem cannot be solved with the current mig-controller implementation. I cannot give you a workaroud, too. This might need re-design of the Stage Pod implementation in mig-controller.

pranavgaikwad commented 3 years ago

@ganesan-cmd @amruhela

I confirmed that the ibmc-file-<tier> volumes don't have any issues like the block volumes. For now, you can use the file volumes to proceed with MTC testing.

ashoks27 commented 3 years ago

@pranavgaikwad Thanks for your update . We have tried migrating both shock-shop and robot-shop using the storage class "ibmc-file-gold" and its still failing . Sock-shop application is migrated without any issue in MTC ,but the application is not running . Whereas Robot-shop migration still hanging in stagepod creation stage. Attached below is the error for your reference. Please have a look and advice us further.

Sock-shop Application

[root@oc8660015353 ~]# oc get pods -n sock-shop NAME READY STATUS RESTARTS AGE carts-7454b9cd59-4r8xf 1/1 Running 0 3m45s carts-db-db6484b95-psh2v 1/1 Running 0 3m45s catalogue-7df8c66889-67gkp 0/1 Running 0 3m45s catalogue-db-1-deploy 0/1 Error 0 3m43s catalogue-db-1-hook-post 0/1 Error 0 3m37s front-end-b9547686-qcz9g 1/1 Running 0 3m44s orders-7566476bd4-qdq5v 1/1 Running 0 3m44s orders-db-6c6d95dbd5-mbfng 1/1 Running 0 3m44s payment-64865ff756-5jxz6 1/1 Running 0 3m44s queue-master-677cf64f87-v7xjx 1/1 Running 0 3m44s rabbitmq-59948ddcbf-zft5x 2/2 Running 0 3m43s session-db-55ff99c9d6-l6cgl 1/1 Running 0 3m42s shipping-6489d78c74-lg2gs 1/1 Running 0 3m42s user-cbcdf4964-rszt9 0/1 Running 0 3m42s user-db-6bb96fb99b-kxvnj 0/1 CrashLoopBackOff 5 3m42s

Robot-shop Application [root@oc4377604745 ~]# oc project robot-shop Now using project "robot-shop" on server "https://c107-e.us-south.containers.cloud.ibm.com:31573". [root@oc4377604745 ~]# oc get all NAME READY STATUS RESTARTS AGE pod/stage-mongodb-6b47bd444f-mhzzn-k2nt4 1/1 Running 0 10m pod/stage-mysql-1-7dc2g-9pwxm 0/2 Pending 0 10m pod/stage-redis-b4cb56cbb-qstvw-7rpl8 1/1 Running 0 10m

Sock-Shop-Migartion-october12.txt robot-shop-Migration-october12.txt

jwmatthews commented 3 years ago

Note, filed a downstream BZ to track this: https://bugzilla.redhat.com/show_bug.cgi?id=1887526

ashoks27 commented 3 years ago

@jwmatthews Thanks for the BZ for the robot-shop application. Can you please help us to fix the sock-shop application issues as well. @pranavgaikwad

jwmatthews commented 3 years ago

@ashoks27 I'll add this to our backlog to consider for an upcoming sprint, I don't believe @pranavgaikwad will have bandwidth to address this in short-term, but supporting IBM ROKS is in our plans, just need to balance a few competing priorities.

Note we are tracking the larger goal of supporting IBM ROKS for MTC migrations in this JIRA: https://issues.redhat.com/browse/MIG-337

Please reach out to me in email (jmatthew@redhat.com) if you'd like to discuss timing of when we could consider working this in a sprint.

ashoks27 commented 3 years ago

@jwmatthews Thank you for your update. I will write an email to you discuss on the timing for the sprint. For now , we would like to know which storage class has worked for @pranavgaikwad in the IBM ROKS (https://github.com/konveyor/mig-controller/issues/694#issuecomment-706380862). It would help us to move forward on our testing quickly. We have tested some the file based storage classes and its not working for us.

shawn-hurley commented 3 years ago

Hello,

There are a couple of issues at play here, AFAICT.

  1. If you have a multi-az setup, you can not use classic block/file storage. You can read more about that here: https://cloud.ibm.com/docs/openshift?topic=openshift-storage_planning

    • What ends up happening here is that the volumes are in two different zones, and the scheduler can not place the pod. That is the node affinity error that is referenced above.
  2. If you are using the same AZ, you need to make sure that the container user has access to the volume or is root. You can read more info that here: https://cloud.ibm.com/docs/openshift?topic=openshift-cs_troubleshoot_storage#cs_storage_nonroot

If you really want to set this up with a multi-az cluster, you will have to:

  1. Follow the guide to create new storage classes
  2. Pre-create the PVC's setting the region and az labels to force them to be colocated.
  3. Update the Robot Shop to set the correct context when deployed

I think that is what you need to do, and please let me know if this works for you.

ashoks27 commented 3 years ago

@shawn-hurley Thanks for your detailed response. As our IBM ROKS cluster is multi-AZ cluster , I have created the new storage class by specifying the Zone . Also have updated the PVC by mentioning the region and az labels. Have applied the scc as runasany user . But still the robot shop application is NOT running .

Attached below is the manifest and logs for your reference. Please check and guide me if I am missing anything here.

[root@oc4377604745 robot-shop]# oc get pods NAME READY STATUS RESTARTS AGE cart-699567cb75-mrchm 1/1 Running 0 2m34s catalogue-7476fdc8f4-gw5vd 1/1 Running 0 2m33s dispatch-85ff479cdd-grvnr 1/1 Running 0 2m31s mongodb-59b795bd6b-dnr89 0/1 CrashLoopBackOff 3 2m29s mysql-1-deploy 1/1 Running 0 2m26s mysql-1-jz7kw 0/2 CrashLoopBackOff 4 2m25s payment-695779d6bb-wt94t 1/1 Running 0 2m23s rabbitmq-7b44f9c449-27nxr 1/1 Running 0 2m21s ratings-c45895c99-m2286 1/1 Running 0 2m20s redis-78dd47d8c-9bft7 1/1 Running 0 2m18s shipping-6fbbc47b67-c7gwh 1/1 Running 0 2m16s user-56987c97fd-zqzwt 1/1 Running 0 2m14s web-ffbbb68c7-kzjrh 1/1 Running 0 2m12s

[root@oc4377604745 robot-shop]# oc logs -f mysql-1-jz7kw -c post-hook cp: cannot create regular file '/tmp/mysql-init-scripts/10-dump.sql.gz': Permission denied cp: cannot create regular file '/tmp/mysql-init-scripts/20-ratings.sql': Permission denied [root@oc4377604745 robot-shop]# [root@oc4377604745 robot-shop]# [root@oc4377604745 robot-shop]# [root@oc4377604745 robot-shop]# oc logs -f mongodb-59b795bd6b-dnr89 about to fork child process, waiting until server is ready for connections. forked process: 18 2020-11-11T15:44:51.648+0000 I CONTROL [main] SERVER RESTARTED 2020-11-11T15:44:51.740+0000 I CONTROL [initandlisten] MongoDB starting : pid=18 port=27017 dbpath=/data/db 64-bit host=mongodb-59b795bd6b-dnr89 2020-11-11T15:44:51.741+0000 I CONTROL [initandlisten] db version v3.6.1 2020-11-11T15:44:51.741+0000 I CONTROL [initandlisten] git version: 025d4f4fe61efd1fb6f0005be20cb45a004093d1 2020-11-11T15:44:51.741+0000 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.1t 3 May 2016 2020-11-11T15:44:51.741+0000 I CONTROL [initandlisten] allocator: tcmalloc 2020-11-11T15:44:51.741+0000 I CONTROL [initandlisten] modules: none 2020-11-11T15:44:51.741+0000 I CONTROL [initandlisten] build environment: 2020-11-11T15:44:51.741+0000 I CONTROL [initandlisten] distmod: debian81 2020-11-11T15:44:51.741+0000 I CONTROL [initandlisten] distarch: x86_64 2020-11-11T15:44:51.741+0000 I CONTROL [initandlisten] target_arch: x86_64 2020-11-11T15:44:51.741+0000 I CONTROL [initandlisten] options: { net: { bindIp: "127.0.0.1", port: 27017, ssl: { mode: "disabled" } }, processManagement: { fork: true, pidFilePath: "/tmp/docker-entrypoint-temp-mongod.pid" }, systemLog: { destination: "file", logAppend: true, path: "/proc/1/fd/1" } } 2020-11-11T15:44:51.743+0000 I STORAGE [initandlisten] exception in initAndListen: IllegalOperation: Attempted to create a lock file on a read-only directory: /data/db, terminating 2020-11-11T15:44:51.743+0000 I CONTROL [initandlisten] now exiting 2020-11-11T15:44:51.743+0000 I CONTROL [initandlisten] shutting down with code:100 ERROR: child process failed, exited with error number 100 To see additional information in this output, start without the "--fork" option. [root@oc4377604745 robot-shop]# [root@oc4377604745 robot-shop]#

[root@oc4377604745 robot-shop_yamls]# oc get sc ibmc-file-bronze-custom-test -o yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{},"name":"ibmc-file-bronze-custom-test"},"parameters":{"billingType":"hourly","classVersion":"2","iopsPerGB":"2","region":"us-south","sizeRange":"[20-12000]Gi","type":"Endurance","zone":"dal10"},"provisioner":"ibm.io/ibmc-file","reclaimPolicy":"Delete","volumeBindingMode":"WaitForFirstConsumer"} creationTimestamp: "2020-11-06T13:46:46Z" labels: billingType: hourly region: us-south zone: dal10 name: ibmc-file-bronze-custom-test resourceVersion: "60391436" selfLink: /apis/storage.k8s.io/v1/storageclasses/ibmc-file-bronze-custom-test uid: dee2ce76-73fa-44d1-b458-9cc2c64d41b5 parameters: billingType: hourly classVersion: "2" iopsPerGB: "2" region: us-south sizeRange: '[20-12000]Gi' type: Endurance zone: dal10 provisioner: ibm.io/ibmc-file reclaimPolicy: Delete volumeBindingMode: WaitForFirstConsumer [root@oc4377604745 robot-shop_yamls]#

robot-shop-manifest.txt robot-shop-scc.txt

shawn-hurley commented 3 years ago

Hello,

I believe a PR was made by @jmontleon to fix up the robot shop to run in this environment. I think we found that even with the path I laid out you end up having an issue with the GID for the volume.

https://github.com/konveyor/mig-demo-apps/pull/26