migtools / mig-controller

OpenShift Migration Controller
Apache License 2.0
22 stars 41 forks source link

DVM fails if exists in the project a PVC that consumes more than 50% of project quota. #1215

Open JoaoBraveCoding opened 2 years ago

JoaoBraveCoding commented 2 years ago

Describe the bug

DVM fails if we try for instance to stage a second time a project that contains a PVC whom size costumes more than 50% of the project quota.

To Reproduce Steps to reproduce the behavior:

  1. Create project with quota to 100Gb
  2. Create PVC of 60Gb
  3. Stage the project for migration
  4. Stage the project for migration a second time

Expected behavior

Staging should happen a second time without DMV failing.

Screenshots & Snippets

image

Additional context We are running MIG operator version 1.5.0. Since I cannot find release notes, I'm not sure if the problem has been addressed in more recent releases.

Log line from oc logs migration-log-reader-657486d85d-mbtd9 -c plain -n openshift-migration| grep '"dvm":"edms-search-dev-staging-23788-8qfgf"'

openshift-migration migration-controller-5ffdb47b68-w9gc2 mtc {"level":"info","ts":1634046512.042396,"logger":"directvolume","msg":"","dvm":"edms-search-dev-staging-23788-8qfgf","migMigration":"edms-search-dev-staging-23788","error":"persistentvolumeclaims \"records-files-claim\" is forbidden: exceeded quota: edms-search-dev, requested: requests.storage=2Ti, used: requests.storage=2Ti, limited: requests.storage=2Ti","stacktrace":"\ngithub.com/konveyor/mig-controller/pkg/controller/directvolumemigration.(*Task).Run()\n\t/opt/app-root/src/github.com/konveyor/mig-controller/pkg/controller/directvolumemigration/task.go:249\ngithub.com/konveyor/mig-controller/pkg/controller/directvolumemigration.(*ReconcileDirectVolumeMigration).migrate()\n\t/opt/app-root/src/github.com/konveyor/mig-controller/pkg/controller/directvolumemigration/migrate.go:39\ngithub.com/konveyor/mig-controller/pkg/controller/directvolumemigration.(*ReconcileDirectVolumeMigration).Reconcile()\n\t/opt/app-root/src/github.com/konveyor/mig-controller/pkg/controller/directvolumemigration/directvolumemigration_controller.go:144\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler()\n\t/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.1-0.20201215171748-096b2e07c091/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem()\n\t/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.1-0.20201215171748-096b2e07c091/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1()\n\t/opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.1-0.20201215171748-096b2e07c091/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()\n\t/opt/app-root/src/go/pkg/mod/k8s.io/apimachinery@v0.20.0/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()\n\t/opt/app-root/src/go/pkg/mod/k8s.io/apimachinery@v0.20.0/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil()\n\t/opt/app-root/src/go/pkg/mod/k8s.io/apimachinery@v0.20.0/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil()\n\t/opt/app-root/src/go/pkg/mod/k8s.io/apimachinery@v0.20.0/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext()\n\t/opt/app-root/src/go/pkg/mod/k8s.io/apimachinery@v0.20.0/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext()\n\t/opt/app-root/src/go/pkg/mod/k8s.io/apimachinery@v0.20.0/pkg/util/wait/wait.go:99\nruntime.goexit()\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1373"}
alaypatel07 commented 2 years ago

Thanks for the report. This is an open issue that we need to handle.

For anyone curious about the root cause, here is why this is happening:

  1. DVM controller tries to create the PVC object on the destination cluster
  2. If the PVC object already exists, the api-server will return AlreadyExists error
  3. Although when the quota is in place, the api-server will try to validate this incoming create request first. The quota error will be hit first and instead of returning AlreadyExists the apiserver returns forbidden: exceeded quota
  4. The DVM controller is not wired to handle any error apart from AlreadyExists and it fails as reported here.

Workaround:

  1. Lift the quota temporarily.
  2. Delete the pvc's in the destination. If the quota is enough for all the PVCs, the new PVCs will be created and copied over. Of course this will lead to all the data being copied again, this can be an option if lifting the quota is not an option.

Bugfix proposal:

Instead of depending on apiserver error to assert if the PVC exists, DVM controller needs to make an explicit get call to see if the PVC exists. The part of error handling that is leading to this error is here