kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.58k stars 1.31k forks source link

Log panics with current log context during reconcile #5709

Closed sbueringer closed 2 years ago

sbueringer commented 2 years ago

User Story

As a operator it would be great if ClusterAPI automatically recovers panics from reconciles and logs the panic instead of failing entirely.

Detailed Description

Currently, if there is a panic in a CAPI controller the whole controller fails / shuts down. This means that if there is a problem with the reconciliation of an individual resource the whole controller fails and becomes useless. It would be great if we instead recover that panic and just return an error in that specific Reconcile call.

Furthermore, it's hard to debug panics because there is currently no way to correlate a panic to a specific resource.

Cases where recover would have helped (just from yesterday):

Anything else you would like to add:

I would propose to add the following statement at the top of all our Reconcile funcs:

    log := ctrl.LoggerFrom(ctx)
    defer func() {
        if r := recover(); r != nil {
            reterr = kerrors.NewAggregate([]error{reterr, errors.Errorf("panic during reconcile: %s\n%s", r, string(debug.Stack()))})
        }
    }()

This leads to the following log message and the controller does not fail:

E1119 11:44:34.429760     105 controller.go:317] controller/cluster "msg"="Reconciler error" "error"="panic during reconcile: runtime error: invalid memory address or nil pointer dereference\ngoroutine 713 [running]:\nruntime/debug.Stack()\n\t/usr/local/Cellar/go/1.17.2/libexec/src/runtime/debug/stack.go:24 +0x7a\nsigs.k8s.io/cluster-api/controllers.(*ClusterReconciler).Reconcile.func1()\n\t/Users/buringerst/code/src/sigs.k8s.io/cluster-api/controllers/cluster_controller.go:102 +0x5e\npanic({0x22e95c0, 0x3535d60})\n\t/usr/local/Cellar/go/1.17.2/libexec/src/runtime/panic.go:1047 +0x262\nsigs.k8s.io/cluster-api/controllers.(*ClusterReconciler).Reconcile(0xc000278af0, {0x26cdd98, 0xc000cda960}, {{{0xc000572870, 0x12}, {0xc000572858, 0x12}}})\n\t/Users/buringerst/code/src/sigs.k8s.io/cluster-api/controllers/cluster_controller.go:109 +0x1aa\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc000786370, {0x26cdd98, 0xc000cda960}, {{{0xc000572870, 0x12}, {0xc000572858, 0x12}}})\n\t/Users/buringerst/code/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0-beta.0.0.20211110210527-619e6b92dab9/pkg/internal/controller/controller.go:114 +0x370\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000786370, {0x26cdd98, 0xc000cda930}, {0x2362060, 0xc000bd1200})\n\t/Users/buringerst/code/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0-beta.0.0.20211110210527-619e6b92dab9/pkg/internal/controller/controller.go:311 +0x430\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000786370, {0x26cdcf0, 0xc000793140})\n\t/Users/buringerst/code/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0-beta.0.0.20211110210527-619e6b92dab9/pkg/internal/controller/controller.go:266 +0x405\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()\n\t/Users/buringerst/code/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0-beta.0.0.20211110210527-619e6b92dab9/pkg/internal/controller/controller.go:227 +0xe8\ncreated by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/Users/buringerst/code/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0-beta.0.0.20211110210527-619e6b92dab9/pkg/internal/controller/controller.go:223 +0x565\n" "name"="quick-start-bpjf7t" "namespace"="quick-start-y8cg2o" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Cluster" 

That looks hard to read, but with a proper logging UI, it's easy to extract the following information (which would have be hard to infer otherwise):

[Miscellaneous information that will assist in solving the issue.]

This is potentially something which should be addressed in controller-runtime instead.

/kind feature /area health

sbueringer commented 2 years ago

@fabriziopandini @vincepri @randomvariable Opinions?

mkumatag commented 2 years ago

/assign @Karthik-K-N

fabriziopandini commented 2 years ago

/milestone v1.1

sbueringer commented 2 years ago

Based on the discussion on https://github.com/kubernetes-sigs/cluster-api/pull/5966 we don't want to recover panics, so that users can't miss the panic because the controller crashloops. To improve the logging of panics in cases where the panic should not be recovered I've opened an issue in controller-runtime: https://github.com/kubernetes-sigs/controller-runtime/issues/1793

sbueringer commented 2 years ago

/retitle Log panics with current log context during reconcile