opendatahub-io / kubeflow

Machine Learning Toolkit for Kubernetes
Apache License 2.0
10 stars 35 forks source link

TODO: unclean shutdown of envtest e2e tests #429

Open jiridanek opened 1 month ago

jiridanek commented 1 month ago

Harmless, but annoying

2024-10-22T13:38:47+02:00   INFO    Stopping and waiting for webhooks
2024-10-22T13:38:47+02:00   INFO    controller-runtime.webhook  Shutting down webhook server with timeout of 1 minute
2024-10-22T13:38:47+02:00   INFO    Stopping and waiting for HTTP servers
2024-10-22T13:38:47+02:00   INFO    Wait completed, proceeding to shutdown the manager
2024-10-22T13:38:47+02:00   DEBUG   controller-runtime.certwatcher  certificate event   {"event": "REMOVE        \"/var/folders/f1/3m518k5d34l72v_9nqyjzqm80000gn/T/envtest-serving-certs-2153001584/tls.key\""}
2024-10-22T13:38:47+02:00   ERROR   controller-runtime.certwatcher  error re-watching file  {"error": "fsnotify: watcher already closed"}
sigs.k8s.io/controller-runtime/pkg/certwatcher.(*CertWatcher).handleEvent
    /Users/jdanek/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/certwatcher/certwatcher.go:185
sigs.k8s.io/controller-runtime/pkg/certwatcher.(*CertWatcher).Watch
    /Users/jdanek/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/certwatcher/certwatcher.go:133
2024-10-22T13:38:47+02:00   ERROR   controller-runtime.certwatcher  error re-reading certificate    {"error": "open /var/folders/f1/3m518k5d34l72v_9nqyjzqm80000gn/T/envtest-serving-certs-2153001584/tls.crt: no such file or directory"}
sigs.k8s.io/controller-runtime/pkg/certwatcher.(*CertWatcher).handleEvent
    /Users/jdanek/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/certwatcher/certwatcher.go:190
sigs.k8s.io/controller-runtime/pkg/certwatcher.(*CertWatcher).Watch
    /Users/jdanek/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/certwatcher/certwatcher.go:133

Ran 2 of 2 Specs in 10.003 seconds
SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 0 Skipped

This is something that people in general struggled with

and we have a TODO comment for it

https://github.com/opendatahub-io/kubeflow/blob/0743e0e444f236b75d90eb2053ac0e79ead5950f/components/odh-notebook-controller/controllers/suite_test.go#L189-L195

The solution seems to be to create a nested context for

subctx, cancel := context.WithCancel(ctx)

...

    // Start the manager
    go func() {
        defer GinkgoRecover()
        err := mgr.Start(ctx)
        Expect(err).ToNot(HaveOccurred(), "Failed to run manager")
    }()

...

// when we're done testing
cancel()

but it needs to be checked.

jiridanek commented 1 month ago

^^ that nested context does not seem to be good enough, I still see unclean shutdowns

jiridanek commented 5 days ago

resolved by suppressing DEBUG logs

jiridanek commented 5 days ago

I like the suggestion of using

// Set the GracefulShutdownTimeout to 0 to prevent errors: // failed waiting for all runnables to end within grace period of 30s: context deadline exceeded // Ref: https://github.com/kubernetes-sigs/controller-runtime/blob/e59161ee/pkg/manager/internal.go#L543-L548 cfg.GracefulShutdownTimeout = lo.ToPtr(time.Duration(0))

jiridanek commented 5 days ago

The webhook timeout cannot be changed

idleConnsClosed := make(chan struct{})
go func() {
    <-ctx.Done()
    log.Info("Shutting down webhook server with timeout of 1 minute")

    ctx, cancel := context.WithTimeout(context.Background(), 1*time.Minute)
    defer cancel()
    if err := srv.Shutdown(ctx); err != nil {
        // Error from closing listeners, or context timeout
        log.Error(err, "error shutting down the HTTP server")
    }
    close(idleConnsClosed)
}()
jiridanek commented 5 days ago

One more advice, similarly to the previous, I found one other likely option in the struct docs

    BaseContext: func() context.Context {
        return ctx
    },
jiridanek commented 5 days ago

Waiting for

err = mgr.Start(ctx)

to return (after ctx is cancelled) and only then calling

err := envTest.Stop()

seems to be effective against

2024-11-17T20:19:57+01:00       ERROR   controller-runtime.certwatcher  error re-reading certificate    {"error": "open /var/folders/f1/3m518k5d34l72v_9nqyjzqm80000gn/T/envtest-serving-certs-737480808/tls.crt: no such file or directory"}