opendatahub-io / kubeflow

Machine Learning Toolkit for Kubernetes
Apache License 2.0
10 stars 34 forks source link

TODO: unclean shutdown of envtest e2e tests #429

Open jiridanek opened 3 weeks ago

jiridanek commented 3 weeks ago

Harmless, but annoying

2024-10-22T13:38:47+02:00   INFO    Stopping and waiting for webhooks
2024-10-22T13:38:47+02:00   INFO    controller-runtime.webhook  Shutting down webhook server with timeout of 1 minute
2024-10-22T13:38:47+02:00   INFO    Stopping and waiting for HTTP servers
2024-10-22T13:38:47+02:00   INFO    Wait completed, proceeding to shutdown the manager
2024-10-22T13:38:47+02:00   DEBUG   controller-runtime.certwatcher  certificate event   {"event": "REMOVE        \"/var/folders/f1/3m518k5d34l72v_9nqyjzqm80000gn/T/envtest-serving-certs-2153001584/tls.key\""}
2024-10-22T13:38:47+02:00   ERROR   controller-runtime.certwatcher  error re-watching file  {"error": "fsnotify: watcher already closed"}
sigs.k8s.io/controller-runtime/pkg/certwatcher.(*CertWatcher).handleEvent
    /Users/jdanek/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/certwatcher/certwatcher.go:185
sigs.k8s.io/controller-runtime/pkg/certwatcher.(*CertWatcher).Watch
    /Users/jdanek/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/certwatcher/certwatcher.go:133
2024-10-22T13:38:47+02:00   ERROR   controller-runtime.certwatcher  error re-reading certificate    {"error": "open /var/folders/f1/3m518k5d34l72v_9nqyjzqm80000gn/T/envtest-serving-certs-2153001584/tls.crt: no such file or directory"}
sigs.k8s.io/controller-runtime/pkg/certwatcher.(*CertWatcher).handleEvent
    /Users/jdanek/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/certwatcher/certwatcher.go:190
sigs.k8s.io/controller-runtime/pkg/certwatcher.(*CertWatcher).Watch
    /Users/jdanek/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/certwatcher/certwatcher.go:133

Ran 2 of 2 Specs in 10.003 seconds
SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 0 Skipped

This is something that people in general struggled with

and we have a TODO comment for it

https://github.com/opendatahub-io/kubeflow/blob/0743e0e444f236b75d90eb2053ac0e79ead5950f/components/odh-notebook-controller/controllers/suite_test.go#L189-L195

The solution seems to be to create a nested context for

subctx, cancel := context.WithCancel(ctx)

...

    // Start the manager
    go func() {
        defer GinkgoRecover()
        err := mgr.Start(ctx)
        Expect(err).ToNot(HaveOccurred(), "Failed to run manager")
    }()

...

// when we're done testing
cancel()

but it needs to be checked.

jiridanek commented 3 weeks ago

^^ that nested context does not seem to be good enough, I still see unclean shutdowns