migtools / mig-controller

OpenShift Migration Controller
Apache License 2.0
22 stars 42 forks source link

Halting and starting cluster nodes produces multiple errors in all migration pods #431

Open Danil-Grigorev opened 4 years ago

Danil-Grigorev commented 4 years ago

Here is a log output, which occurs in both discovery and cam containers:

This started to appear on OpenTLC clsuters over night, when the nodes stopped and were restarted again next morning. Seems like the scheme registration in the migController didn't pick the core resources upon controller setup.

{"level":"error","ts":1582712218.5711539,"logger":"plan|xjnxs","msg":"","error":"no matches for kind \"ImageStream\" in version \"image.openshift.io/v1\"","stacktrace":"github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/fusor/mig-controller/pkg/logging.Logger.Error\n\t/go/src/github.com/fusor/mig-controller/pkg/logging/logger.go:75\ngithub.com/fusor/mig-controller/pkg/logging.Logger.Trace\n\t/go/src/github.com/fusor/mig-controller/pkg/logging/logger.go:81\ngithub.com/fusor/mig-controller/pkg/controller/migplan.ReconcileMigPlan.ensureRegistryImageStream\n\t/go/src/github.com/fusor/mig-controller/pkg/controller/migplan/registry.go:127\ngithub.com/fusor/mig-controller/pkg/controller/migplan.ReconcileMigPlan.ensureMigRegistries\n\t/go/src/github.com/fusor/mig-controller/pkg/controller/migplan/registry.go:52\ngithub.com/fusor/mig-controller/pkg/controller/migplan.(*ReconcileMigPlan).Reconcile\n\t/go/src/github.com/fusor/mig-controller/pkg/controller/migplan/migplan_controller.go:243\ngithub.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215\ngithub.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

Danil-Grigorev commented 4 years ago

Some errors also appear in mig-operator pods:

Failed to find exact match for migration.openshift.io/v1alpha1.MigStorage by [kind, name, singularName, shortNames]

Velero:

time="2020-02-26T16:23:32Z" level=error msg="Error checking repository for stale locks" controller=restic-repository error="backupstoragelocation.velero.io \"backup-z64cs\" not found" logSource="pkg/controller/restic_repository_controller.go:142" name=ci-backup-z64cs-cf2ds namespace=openshift-migration

Looking into events, it is apparent that the moment the pods were coming up the cluster network was not in a ready state

113m        Warning   NetworkNotReady        pod/migration-controller-6d59b8c4c6-twczk   network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network