karmada-search: Fix lock race affects watch RestChan not close, causing client watch api to hang.

xigang commented 5 months ago

What type of PR is this?

/kind bug

What this PR does / why we need it: When the result channel of watchMux is blocked, it will cause goroutine leak and lock race. this problem affects the client watch api interface to hang when consuming watch.RestChan.

fetch goroutine stack:

curl http://localhost:6062/debug/pprof/goroutine?debug=1

Through the goroutine stack below, there are 1892 goroutines blocked in runtime_SemacquireMutex.

1892 @ 0x439936 0x44a80c 0x44a7e6 0x466ba5 0x1a1a5fd 0x1a1a5de 0x1a1a32a 0x46acc1
#       0x466ba4        sync.runtime_SemacquireMutex+0x24                                                                               /usr/lib/golang/src/runtime/sema.go:71
#       0x1a1a5fc       sync.(*RWMutex).RLock+0x7c                                                                                      /usr/lib/golang/src/sync/rwmutex.go:63
#       0x1a1a5dd       go/src/github.com/karmada/pkg/search/proxy/store.(*watchMux).startWatchSource.func1+0x5d     /go/src/go/src/github.com/karmada/pkg/search/proxy/store/util.go:229
#       0x1a1a329       go/src/github.com/karmada/pkg/search/proxy/store.(*watchMux).startWatchSource+0xe9           /go/src/go/src/github.com/karmada/pkg/search/proxy/store/util.go:237

967 @ 0x439936 0x405565 0x40511d 0x1a1a691 0x1a1a32a 0x46acc1
#       0x1a1a690       go/src/github.com/karmada/pkg/search/proxy/store.(*watchMux).startWatchSource.func1+0x110    /go/src/go/src/github.com/karmada/pkg/search/proxy/store/util.go:235
#       0x1a1a329       go/src/github.com/karmada/pkg/search/proxy/store.(*watchMux).startWatchSource+0xe9           /go/src/go/src/github.com/karmada/pkg/search/proxy/store/util.go:237

573 @ 0x439936 0x44a80c 0x44a7e6 0x466ba5 0x474fa5 0x476576 0x476555 0x1a1a132 0x18e1a0d 0x18ce325 0x16aadc2 0x18cd8fb 0x79d3cf 0x16acb6c 0x79d3cf 0x1a3cbcb 0x79d3cf 0x18da0a6 0x16aadc2 0x18d9a1b 0x18f6def 0x16ac4a2 0x16976f1 0x19214e6 0x19214b6 0x16be6a5 0x79d3cf 0x16c807d 0x79d3cf 0x16be098 0x79d3cf 0x186a2e4
#       0x466ba4        sync.runtime_SemacquireMutex+0x24                                                                               /usr/lib/golang/src/runtime/sema.go:71
#       0x474fa4        sync.(*Mutex).lockSlow+0x164                                                                                    /usr/lib/golang/src/sync/mutex.go:138
#       0x476575        sync.(*Mutex).Lock+0x35                                                                                         /usr/lib/golang/src/sync/mutex.go:81
#       0x476554        sync.(*RWMutex).Lock+0x14                                                                                       /usr/lib/golang/src/sync/rwmutex.go:111
#       0x1a1a131       go/src/github.com/karmada/pkg/search/proxy/store.(*watchMux).Stop+0x51                       /go/src/go/src/github.com/karmada/pkg/search/proxy/store/util.go:193
#       0x18e1a0c       k8s.io/apiserver/pkg/endpoints/handlers.serveWatch+0xf8c                                                        /go/src/go/src/github.com/karmada/vendor/k8s.io/apiserver/pkg/endpoints/handlers/watch.go:139
#       0x18ce324       k8s.io/apiserver/pkg/endpoints/handlers.ListResource.func1.2+0x64                                               /go/src/go/src/github.com/karmada/vendor/k8s.io/apiserver/pkg/endpoints/handlers/get.go:270
#       0x16aadc1       k8s.io/apiserver/pkg/endpoints/metrics.RecordLongRunning+0x721                                                  /go/src/go/src/github.com/karmada/vendor/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go:427
#       0x18cd8fa       k8s.io/apiserver/pkg/endpoints/handlers.ListResource.func1+0xf1a                                                /go/src/go/src/github.com/karmada/vendor/k8s.io/apiserver/pkg/endpoints/handlers/get.go:269
#       0x79d3ce        net/http.HandlerFunc.ServeHTTP+0x2e                                                                             /usr/lib/golang/src/net/http/server.go:2046
#       0x16acb6b       k8s.io/apiserver/pkg/endpoints/metrics.InstrumentHandlerFunc.func1+0x2cb                                        /go/src/go/src/github.com/karmada/vendor/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go:509
#       0x79d3ce        net/http.HandlerFunc.ServeHTTP+0x2e                                                                             /usr/lib/golang/src/net/http/server.go:2046
#       0x1a3cbca       go/src/github.com/karmada/pkg/controllers/proxy.(*Controller).Connect.func1+0x6aa        /go/src/go/src/github.com/karmada/pkg/controllers/proxy/proxy_controller.go:256
#       0x79d3ce        net/http.HandlerFunc.ServeHTTP+0x2e                                                                             /usr/lib/golang/src/net/http/server.go:2046
#       0x18da0a5       k8s.io/apiserver/pkg/endpoints/handlers.ConnectResource.func1.1+0x245                                           /go/src/go/src/github.com/karmada/vendor/k8s.io/apiserver/pkg/endpoints/handlers/rest.go:224
#       0x16aadc1       k8s.io/apiserver/pkg/endpoints/metrics.RecordLongRunning+0x721                                                  /go/src/go/src/github.com/karmada/vendor/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go:427
#       0x18d9a1a       k8s.io/apiserver/pkg/endpoints/handlers.ConnectResource.func1+0xcba                                             /go/src/go/src/github.com/karmada/vendor/k8s.io/apiserver/pkg/endpoints/handlers/rest.go:218
#       0x18f6dee       k8s.io/apiserver/pkg/endpoints.restfulConnectResource.func1+0x6e                                                /go/src/go/src/github.com/karmada/vendor/k8s.io/apiserver/pkg/endpoints/installer.go:1228
#       0x16ac4a1       k8s.io/apiserver/pkg/endpoints/metrics.InstrumentRouteFunc.func1+0x2e1                                          /go/src/go/src/github.com/karmada/vendor/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go:484
#       0x16976f0       github.com/emicklei/go-restful.(*Container).dispatch+0x9d0                                                      /go/src/go/src/github.com/karmada/vendor/github.com/emicklei/go-restful/container.go:288
#       0x19214e5       github.com/emicklei/go-restful.(*Container).Dispatch+0x645                                                      /go/src/go/src/github.com/karmada/vendor/github.com/emicklei/go-restful/container.go:199
#       0x19214b5       k8s.io/apiserver/pkg/server.director.ServeHTTP+0x615                                                            /go/src/go/src/github.com/karmada/vendor/k8s.io/apiserver/pkg/server/handler.go:146
#       0x16be6a4       k8s.io/apiserver/pkg/endpoints/filterlatency.trackCompleted.func1+0x1a4                                         /go/src/go/src/github.com/karmada/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:103

Refer to cacheWatcher send event to input channel to implement adding a timeout.

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher/cache_watcher.go#L216

    // OK, block sending, but only until timer fires.
    select {
    case c.input <- event:
        return true
    case <-timer.C:
        closeFunc()
        return false
    }

Which issue(s) this PR fixes: Fixes #

Special notes for your reviewer: @ikaven1024 @RainbowMango @XiShanYongYe-Chang

Does this PR introduce a user-facing change?:

`karmada-search`: Fix lock race affects watch RestChan not close, causing client hanging when performing watch.

codecov-commenter commented 5 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (1b2c6ed) 52.78% compared to head (9c9c52c) 52.77%.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #4212 +/- ## ========================================== - Coverage 52.78% 52.77% -0.01% ========================================== Files 239 239 Lines 23584 23585 +1 ========================================== Hits 12448 12448 Misses 10460 10460 - Partials 676 677 +1 ``` | [Flag](https://app.codecov.io/gh/karmada-io/karmada/pull/4212/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=karmada-io) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/karmada-io/karmada/pull/4212/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=karmada-io) | `52.77% <100.00%> (-0.01%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=karmada-io#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/karmada-io/karmada/pull/4212?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=karmada-io) | Coverage Δ | | |---|---|---| | [pkg/search/proxy/store/util.go](https://app.codecov.io/gh/karmada-io/karmada/pull/4212?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=karmada-io#diff-cGtnL3NlYXJjaC9wcm94eS9zdG9yZS91dGlsLmdv) | `92.88% <100.00%> (+0.82%)` | :arrow_up: | ... and [1 file with indirect coverage changes](https://app.codecov.io/gh/karmada-io/karmada/pull/4212/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=karmada-io)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

xigang commented 5 months ago

/cc @ikaven1024 @RainbowMango @XiShanYongYe-Chang

XiShanYongYe-Chang commented 5 months ago

Thanks @xigang /assign

RainbowMango commented 5 months ago

@xigang Do you know how to reproduce it?

When the result channel of watchMux is blocked, it will cause goroutine leak and lock race.

Since the result channel is no cached, so I can understand that the goroutine will be blocked when trying send event to it: https://github.com/karmada-io/karmada/blob/d6e28817c381b7524ac0cfabd077dd25226f9410/pkg/search/proxy/store/util.go#L259

Can you remind me why it will cause goroutine leak and lock race?

xigang commented 5 months ago

Currently, our federated cluster has 1.8W+ nodes (the size of the nodes will increase). The client uses the NodeInformer ListWatch Nodes resource for a period of time (more than 10 hours will hang. I guess the lock race may have intensified at this time), and the client's watch interface will hang.

xigang commented 5 months ago

@xigang Do you know how to reproduce it?

When the result channel of watchMux is blocked, it will cause goroutine leak and lock race.

Since the result channel is no cached, so I can understand that the goroutine will be blocked when trying send event to it:

https://github.com/karmada-io/karmada/blob/d6e28817c381b7524ac0cfabd077dd25226f9410/pkg/search/proxy/store/util.go#L259

Can you remind me why it will cause goroutine leak and lock race?

[client side] First, we added debug log for the watchHandler of the reflector of client-go and found that the client was stuck w.ResultChan().

    for {
        select {
        case <-stopCh:
            klog.Infof("reflector stopped")
            return errorStopRequested
        case err := <-errc:
            klog.Infof("watch handler received error: %v", err.Error())
            return err
        case event, ok := <-w.ResultChan():
            if !ok {
                klog.Infof("watch server result chan is closed. for loop")
                break loop
            }
            klog.Infof("consume event success")

reproduce log:

[server side] Then we added debug log for serveWatch and ServeHTTP toWatchServer on the server side, and found that when the watch http2 connection closed, it got stuck while executing defer watcher.Stop().

ServeHTTP:

    for {
        select {
        case <-done:
            klog.Infof("client %v watch server shutdown.", userAgent)
            return
        case <-timeoutCh:
            klog.Infof("client %v watch server timeout.", userAgent)
            return
        case event, ok := <-ch:
            if !ok {
                // End of results.
                klog.Infof("client %v watch server closed.", userAgent)
                return
            }

serveWatch:

// serveWatch will serve a watch response.
// TODO: the functionality in this method and in WatchServer.Serve is not cleanly decoupled.
func serveWatch(watcher watch.Interface, scope *RequestScope, mediaTypeOptions negotiation.MediaTypeOptions, req *http.Request, w http.ResponseWriter, timeout time.Duration) {
    klog.Infof("watcher %v connecting", req.UserAgent())
    defer func(){
        klog.Infof("watcher %v stop done in serveWatch", req.UserAgent()) //No execution here
    }()
    defer watcher.Stop() //code block here
    defer func(){
        klog.Infof("watcher %v http2 closed.", req.UserAgent()) //The code is executed here
    }()

reproduce log:

Finally, through the goroutine profile, it was found that a large number of goroutines were stuck and locked runtime_SemacquireMutex.

1892 @ 0x439936 0x44a80c 0x44a7e6 0x466ba5 0x1a1a5fd 0x1a1a5de 0x1a1a32a 0x46acc1
#       0x466ba4        sync.runtime_SemacquireMutex+0x24                                                                               /usr/lib/golang/src/runtime/sema.go:71
#       0x1a1a5fc       sync.(*RWMutex).RLock+0x7c                                                                                      /usr/lib/golang/src/sync/rwmutex.go:63
#       0x1a1a5dd       go/src/github.com/karmada/pkg/search/proxy/store.(*watchMux).startWatchSource.func1+0x5d     /go/src/go/src/github.com/karmada/pkg/search/proxy/store/util.go:229
#       0x1a1a329       go/src/github.com/karmada/pkg/search/proxy/store.(*watchMux).startWatchSource+0xe9           /go/src/go/src/github.com/karmada/pkg/search/proxy/store/util.go:237

ikaven1024 commented 5 months ago

Why dose it block at fourth renow, not before?

xigang commented 5 months ago

Why dose it block at fourth renow, not before?

@ikaven1024 In the log, the hang appears not for the fourth watch, but for the Nth time. Because the server will regularly close the watch api connection, the client will renew the watch. This prevents the client from always watching a single kube-apiserver.

ikaven1024 commented 5 months ago

The leak come here:

watchMux receives an event fron Cacher, locked and enter to the default case, waiting for sending to result chan. https://github.com/karmada-io/karmada/blob/67351f48c8e27f635757e3cd00fa9c577f6ee5d9/pkg/search/proxy/store/util.go#L252-L262
While client not receive this event and close the watch, calling https://github.com/karmada-io/karmada/blob/3f5c9073ab197dbc83f217448215a767f7fd0e94/vendor/k8s.io/apiserver/pkg/endpoints/handlers/watch.go#L66-L67
In Close, alse want lock. So dead lock occur. https://github.com/karmada-io/karmada/blob/67351f48c8e27f635757e3cd00fa9c577f6ee5d9/pkg/search/proxy/store/util.go#L209-L225

This can be tested with:

func TestName(t *testing.T) {
    // watcher from cacher
    cacheW := watch.NewFakeWithChanSize(1, false)

    // client start watch, and not receive the event
    clientW := newWatchMux()
    clientW.AddSource(cacheW, func(event watch.Event) {})
    clientW.Start()

    // receive an object from cacher.
    // while client does not consume it.
    cacheW.Add(&corev1.Pod{})

    time.Sleep(time.Second)
    // client close.
    clientW.Stop()
    // DEADLOCK!!!
}

ikaven1024 commented 5 months ago

Does below code work,without adding a timeout. @xigang

        func() {
            // w.lock.RLock()
            // defer w.lock.RUnlock()
            select {
            case <-w.done:
                return
            case w.result <- copyEvent:
            }
        }()

xigang commented 5 months ago

@ikaven1024 Thanks for the analysis. Is there any problem in remove RLock when multiple goroutines process w.done and w.result?

If RLock is retained. Is it better to use timeout to reduce lock race?

ikaven1024 commented 5 months ago

Is there any problem in remove RLock when multiple goroutines process w.done and w.result?

It's safe reading, writing and closing a chan in multiple goroutines.

If RLock is retained. Is it better to use timeout to reduce lock race?

It will cause client break the watch, and send a new watch request, Increasing the burden on the server.

xigang commented 5 months ago

Is there any problem in remove RLock when multiple goroutines process w.done and w.result?

It's safe reading, writing and closing a chan in multiple goroutines.

If RLock is retained. Is it better to use timeout to reduce lock race?

It will cause client break the watch, and send a new watch request, Increasing the burden on the server.

@ikaven1024 I think it's ok, I resubmitted the code. To avoid the exception, I added additional recover() for startWatchSource.

XiShanYongYe-Chang commented 5 months ago

Hi @xigang Is the current pr based on fixing #4221?

xigang commented 5 months ago

Hi @xigang Is the current pr based on fixing #4221?

@XiShanYongYe-Chang Yes, @ikaven1024 open a new PR #4221 to fix this bug.

XiShanYongYe-Chang commented 5 months ago

Thanks for your quick response @xigang

RainbowMango commented 5 months ago

@XiShanYongYe-Chang Yes, @ikaven1024 open a new PR https://github.com/karmada-io/karmada/pull/4221 to fix this bug.

We usually don't open another PR to replace the current in-review PR. I don't know which PR we should move forward with.

@xigang What do you think?

xigang commented 5 months ago

@XiShanYongYe-Chang Yes, @ikaven1024 open a new PR #4221 to fix this bug.

We usually don't open another PR to replace the current in-review PR. I don't know which PR we should move forward with.

@xigang What do you think?

@RainbowMango Agreed, in order to track this issue, let's continue to fix this issue in this PR.

RainbowMango commented 5 months ago

Yes, we can focus on fixing this issue and keep the minimum change( so that we can easily backport this to release branches), and let #4221 focus on additional improvement.

Does that make sense to you? @xigang @ikaven1024

xigang commented 5 months ago

Yes, we can focus on fixing this issue and keep the minimum change( so that we can easily backport this to release branches), and let #4221 focus on additional improvement.

Does that make sense to you? @xigang @ikaven1024

@ikaven1024 Migrate #4221 code to this PR? Or use timeout to temporarily fix this issue?

ikaven1024 commented 5 months ago

@ikaven1024 Migrate #4221 code to this PR? Or use timeout to temporarily fix this issue?

You can migrate #4221 here, and i will close it.

xigang commented 5 months ago

@ikaven1024 Migrate #4221 code to this PR? Or use timeout to temporarily fix this issue?

You can migrate #4221 here, and i will close it.

Done.

ikaven1024 commented 5 months ago

/lgtm /approve

karmada-bot commented 5 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ikaven1024

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[pkg/search/OWNERS](https://github.com/karmada-io/karmada/blob/master/pkg/search/OWNERS)~~ [ikaven1024] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

XiShanYongYe-Chang commented 5 months ago

Hi @xigang do you think we need to cherry-pick this patch to the previous branch?

xigang commented 5 months ago

cherry-pick

@XiShanYongYe-Chang I think we should cherry-pick the previous branch and let me handle it.

XiShanYongYe-Chang commented 5 months ago

@XiShanYongYe-Chang I think we should cherry-pick the previous branch and let me handle it.

Thanks~

karmada-io / karmada

karmada-search: Fix lock race affects watch RestChan not close, causing client watch api to hang. #4212

Codecov Report