kubernetes / ingress-nginx

Ingress-NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.3k stars 8.22k forks source link

panic: interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Service #10015

Open mintcckey opened 1 year ago

mintcckey commented 1 year ago

What happened:

We noticed several times when pods had panic with "interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.Service". Checked the original source code and it seems like the type assertion doesn't handle the case where the obj could be cache.DeletedFinalStateUnknown. I am sorry if this issue might have been reported by others already.

Thanks.

E0525 04:03:38.226398 7 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(runtime._type)(0x17c10e0), concrete:(runtime._type)(0x18a3e00), asserted:(runtime._type)(0x1a23800), missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not v1.Service) goroutine 125 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1822d40?, 0xc003232660}) k8s.io/apimachinery@v0.25.2/pkg/util/runtime/runtime.go:75 +0x99 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x203020203030202?}) k8s.io/apimachinery@v0.25.2/pkg/util/runtime/runtime.go:49 +0x75 panic({0x1822d40, 0xc003232660}) runtime/panic.go:884 +0x212 k8s.io/ingress-nginx/internal/ingress/controller/store.New.func21({0x18a3e00?, 0xc001d820e0?}) k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:772 +0xde k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...) k8s.io/client-go@v0.25.2/tools/cache/controller.go:246 k8s.io/client-go/tools/cache.(processorListener).run.func1() k8s.io/client-go@v0.25.2/tools/cache/shared_informer.go:820 +0xaf k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?) k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:157 +0x3e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000659f38?, {0x1ccc140, 0xc00056ea20}, 0x1, 0xc000520cc0) k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:158 +0xb6 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0xc000659f88?) k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:135 +0x89 k8s.io/apimachinery/pkg/util/wait.Until(...) k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:92 k8s.io/client-go/tools/cache.(processorListener).run(0xc0007b3a80?) k8s.io/client-go@v0.25.2/tools/cache/shared_informer.go:812 +0x6b k8s.io/apimachinery/pkg/util/wait.(Group).Start.func1() k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:75 +0x5a created by k8s.io/apimachinery/pkg/util/wait.(Group).Start k8s.io/apimachinery@v0.25.2/pkg/util/wait/wait.go:73 +0x85 panic: interface conversion: interface {} is cache.DeletedFinalStateUnknown, not v1.Service [recovered] panic: interface conversion: interface {} is cache.DeletedFinalStateUnknown, not v1.Service

What you expected to happen:

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): controller-v1.4.0 tag

Kubernetes version (use kubectl version): v1.25.7

Environment:

We didn't use Helm to install chart but instead a YAML manifest apply.

Below is the code snippet that I got from the ingress-nginx/internal/ingress/controller/store/store.go with tag controller-v1.4.0 where I think the panic happens.

serviceHandler := cache.ResourceEventHandlerFuncs{
        AddFunc: func(obj interface{}) {
            svc := obj.(*corev1.Service)
            if svc.Spec.Type == corev1.ServiceTypeExternalName {
                updateCh.In() <- Event{
                    Type: CreateEvent,
                    Obj:  obj,
                }
            }
        },
        DeleteFunc: func(obj interface{}) {
            svc := obj.(*corev1.Service)
            if svc.Spec.Type == corev1.ServiceTypeExternalName {
                updateCh.In() <- Event{
                    Type: DeleteEvent,
                    Obj:  obj,
                }
            }
        },
        UpdateFunc: func(old, cur interface{}) {
            oldSvc := old.(*corev1.Service)
            curSvc := cur.(*corev1.Service)

            if reflect.DeepEqual(oldSvc, curSvc) {
                return
            }

            updateCh.In() <- Event{
                Type: UpdateEvent,
                Obj:  cur,
            }
        },
    }

How to reproduce this issue:

Anything else we need to know:

It's not a critical issue for us as the pod can restore after crashing.

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
longwuyuan commented 1 year ago

/remove-kind bug

Hi, The information you provided is limited to some log messages and that is not enough to reproduce the problem for analysis.

Firstly, can you kindly answer all the questions that are visible in the template but have been skipped by you. That info may provide context & details that will help analysis.

Importantly, if you can kindly write a step-by-step guide that anyone can copy/paste on a kind cluster or a minikube cluster, it will help make some progress on this issue. Thanks

mintcckey commented 1 year ago

/remove-kind bug

Hi, The information you provided is limited to some log messages and that is not enough to reproduce the problem for analysis.

Firstly, can you kindly answer all the questions that are visible in the template but have been skipped by you. That info may provide context & details that will help analysis.

Importantly, if you can kindly write a step-by-step guide that anyone can copy/paste on a kind cluster or a minikube cluster, it will help make some progress on this issue. Thanks

Hi there, I tried to provide as much information as I know and I will keep updating them once more information is available to me. Thanks.

tao12345666333 commented 1 year ago

Could you please try to upgrade your ingress-nginx controller to latest version?

mintcckey commented 1 year ago

Could you please try to upgrade your ingress-nginx controller to latest version?

Hi there, we do plan to upgrade. But we would also want to understand the root cause to make sure that it's been fixed in the more recent versions. Thank you.

tao12345666333 commented 1 year ago

In fact, version 1.4 is no longer supported. So I hope you can upgrade it before we confirm again. https://github.com/kubernetes/ingress-nginx#supported-versions-table

github-actions[bot] commented 1 year ago

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

dvaldivia commented 6 months ago

I think this problem is related to the informer closing down for some reason and pushing this type of message

Scusemua commented 6 months ago

I've also experienced this issue, though I don't have a good way to reproduce it.

geberl commented 5 months ago

I encountered something similar when my Pod's ClusterRole was missing the watch Pods permission (here probabaly Services). cache.DeletedFinalStateUnknown objects are gone once this permission is present.

fredleger commented 2 months ago

Maybe similar in 1.10 :

    k8s.io/apimachinery@v0.29.3/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x19c4120?, 0x2ced7f0?})
    runtime/panic.go:770 +0x132
k8s.io/ingress-nginx/internal/ingress/controller/store.New.func21({0x1a77ae0?, 0xc02539d5a0?})
    k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:819 +0x102
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
    k8s.io/client-go@v0.29.3/tools/cache/controller.go:253
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
    k8s.io/client-go@v0.29.3/tools/cache/shared_informer.go:977 +0x9f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
    k8s.io/apimachinery@v0.29.3/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000193f70, {0x1f08040, 0xc00030b680}, 0x1, 0xc0001219e0)
    k8s.io/apimachinery@v0.29.3/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00052bf70, 0x3b9aca00, 0x0, 0x1, 0xc0001219e0)
    k8s.io/apimachinery@v0.29.3/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
    k8s.io/apimachinery@v0.29.3/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc000324ab0)
    k8s.io/client-go@v0.29.3/tools/cache/shared_informer.go:966 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
    k8s.io/apimachinery@v0.29.3/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 41
    k8s.io/apimachinery@v0.29.3/pkg/util/wait/wait.go:70 +0x73
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x158 pc=0x17013c2]

2 pods were running as a daemonSet on dedicated hardware and both crashed at the exact same time Proxy Protocol v2 activated

tao12345666333 commented 2 months ago

Anyone has steps to reproduce this issue?

fredleger commented 2 months ago

For me it seem to happen randomly because i haven't seen it in the past and it occured only once in a few months. THe thing is we have a lot of ingresses (~ 3500) and ingress-nginx has 15Gi RAM allocated. I noticed some kind of GC that free some RAM on a schedule (i guess - but only guess LUA shared objects, since multiple pods show the exact same behavior). Maybe this GC is collecting something it should not ?

TwoStone commented 1 week ago

We experienced this issue with the 1.11.2 version. We are running 4 independent ingress controller and saw this error on 3 of this within 10 seconds.

E0912 11:44:11.621441       7 store.go:817] unexpected type: cache.DeletedFinalStateUnknown
E0912 11:44:11.621521       7 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 145 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1a64b60, 0x2e19780})
    k8s.io/apimachinery@v0.30.3/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0007aa700?})
    k8s.io/apimachinery@v0.30.3/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x1a64b60?, 0x2e19780?})
    runtime/panic.go:770 +0x132
k8s.io/ingress-nginx/internal/ingress/controller/store.New.func21({0x1b1ef60?, 0xc00d1eb320?})
    k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:819 +0x102
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnDelete(...)
    k8s.io/client-go@v0.30.3/tools/cache/controller.go:253
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
    k8s.io/client-go@v0.30.3/tools/cache/shared_informer.go:983 +0x9f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
    k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc010eb9f70, {0x1fce640, 0xc000168ff0}, 0x1, 0xc000340c00)
    k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000738f70, 0x3b9aca00, 0x0, 0x1, 0xc000340c00)
    k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
    k8s.io/apimachinery@v0.30.3/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc000134a20)
    k8s.io/client-go@v0.30.3/tools/cache/shared_informer.go:972 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
    k8s.io/apimachinery@v0.30.3/pkg/util/wait/wait.go:72 +0x52
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 108
    k8s.io/apimachinery@v0.30.3/pkg/util/wait/wait.go:70 +0x73
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x158 pc=0x1790242]
fredleger commented 1 week ago

@TwoStone do you have something like prom metrics / grafana dashboard at the time of the error ?

TwoStone commented 1 week ago

@TwoStone do you have something like prom metrics / grafana dashboard at the time of the error ?

@fredleger Yes we have prometheus metrics for the time of error, what are you interested in?

fredleger commented 1 week ago

@TwoStone do you have something like prom metrics / grafana dashboard at the time of the error ?

@fredleger Yes we have prometheus metrics for the time of error, what are you interested in?

If some GC occurred at that time. I guess you can deduct it from memory consumption or better from the size of Lua cache if I am not wrong

TwoStone commented 1 week ago
Screenshot 2024-09-17 at 13 05 03

Unfortunately we don't have the size of the lua cache, but if you have a look at the virtual memory, we see a drop on all controllers that showed the panic. So looks that same GC operation have occured around this time.

fredleger commented 1 week ago

So that was the track i was also following. Looks like that there is some GC implied before the crash. Now it's up to someone who have skills on this repo (which is not my case unfortunatly)