Closed xigang closed 5 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
1b2c6ed
) 52.78% compared to head (9c9c52c
) 52.77%.
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
/cc @ikaven1024 @RainbowMango @XiShanYongYe-Chang
Thanks @xigang /assign
@xigang Do you know how to reproduce it?
When the result channel of watchMux is blocked, it will cause goroutine leak and lock race.
Since the result channel is no cached, so I can understand that the goroutine will be blocked when trying send event to it: https://github.com/karmada-io/karmada/blob/d6e28817c381b7524ac0cfabd077dd25226f9410/pkg/search/proxy/store/util.go#L259
Can you remind me why it will cause goroutine leak and lock race?
Currently, our federated cluster has 1.8W+ nodes (the size of the nodes will increase). The client uses the NodeInformer
ListWatch Nodes resource for a period of time (more than 10 hours will hang. I guess the lock race may have intensified at this time), and the client's watch interface will hang.
@xigang Do you know how to reproduce it?
When the result channel of watchMux is blocked, it will cause goroutine leak and lock race.
Since the result channel is no cached, so I can understand that the goroutine will be blocked when trying send event to it:
Can you remind me why it will cause goroutine leak and lock race?
[client side]
First, we added debug log for the watchHandler of the reflector
of client-go and found that the client was stuck w.ResultChan()
.
for {
select {
case <-stopCh:
klog.Infof("reflector stopped")
return errorStopRequested
case err := <-errc:
klog.Infof("watch handler received error: %v", err.Error())
return err
case event, ok := <-w.ResultChan():
if !ok {
klog.Infof("watch server result chan is closed. for loop")
break loop
}
klog.Infof("consume event success")
reproduce log:
[server side]
Then we added debug log for serveWatch and ServeHTTP toWatchServer
on the server side, and found that when the watch http2 connection closed, it got stuck while executing defer watcher.Stop()
.
ServeHTTP:
for {
select {
case <-done:
klog.Infof("client %v watch server shutdown.", userAgent)
return
case <-timeoutCh:
klog.Infof("client %v watch server timeout.", userAgent)
return
case event, ok := <-ch:
if !ok {
// End of results.
klog.Infof("client %v watch server closed.", userAgent)
return
}
serveWatch:
// serveWatch will serve a watch response.
// TODO: the functionality in this method and in WatchServer.Serve is not cleanly decoupled.
func serveWatch(watcher watch.Interface, scope *RequestScope, mediaTypeOptions negotiation.MediaTypeOptions, req *http.Request, w http.ResponseWriter, timeout time.Duration) {
klog.Infof("watcher %v connecting", req.UserAgent())
defer func(){
klog.Infof("watcher %v stop done in serveWatch", req.UserAgent()) //No execution here
}()
defer watcher.Stop() //code block here
defer func(){
klog.Infof("watcher %v http2 closed.", req.UserAgent()) //The code is executed here
}()
reproduce log:
Finally, through the goroutine profile, it was found that a large number of goroutines were stuck and locked runtime_SemacquireMutex
.
1892 @ 0x439936 0x44a80c 0x44a7e6 0x466ba5 0x1a1a5fd 0x1a1a5de 0x1a1a32a 0x46acc1
# 0x466ba4 sync.runtime_SemacquireMutex+0x24 /usr/lib/golang/src/runtime/sema.go:71
# 0x1a1a5fc sync.(*RWMutex).RLock+0x7c /usr/lib/golang/src/sync/rwmutex.go:63
# 0x1a1a5dd go/src/github.com/karmada/pkg/search/proxy/store.(*watchMux).startWatchSource.func1+0x5d /go/src/go/src/github.com/karmada/pkg/search/proxy/store/util.go:229
# 0x1a1a329 go/src/github.com/karmada/pkg/search/proxy/store.(*watchMux).startWatchSource+0xe9 /go/src/go/src/github.com/karmada/pkg/search/proxy/store/util.go:237
Why dose it block at fourth renow, not before?
Why dose it block at fourth renow, not before?
@ikaven1024 In the log, the hang appears not for the fourth watch, but for the Nth time. Because the server will regularly close the watch api connection, the client will renew the watch. This prevents the client from always watching a single kube-apiserver
.
The leak come here:
watchMux
receives an event fron Cacher
, locked and enter to the default case, waiting for sending to result
chan.
https://github.com/karmada-io/karmada/blob/67351f48c8e27f635757e3cd00fa9c577f6ee5d9/pkg/search/proxy/store/util.go#L252-L262
While client not receive this event and close the watch, calling https://github.com/karmada-io/karmada/blob/3f5c9073ab197dbc83f217448215a767f7fd0e94/vendor/k8s.io/apiserver/pkg/endpoints/handlers/watch.go#L66-L67
In Close
, alse want lock. So dead lock occur.
https://github.com/karmada-io/karmada/blob/67351f48c8e27f635757e3cd00fa9c577f6ee5d9/pkg/search/proxy/store/util.go#L209-L225
This can be tested with:
func TestName(t *testing.T) {
// watcher from cacher
cacheW := watch.NewFakeWithChanSize(1, false)
// client start watch, and not receive the event
clientW := newWatchMux()
clientW.AddSource(cacheW, func(event watch.Event) {})
clientW.Start()
// receive an object from cacher.
// while client does not consume it.
cacheW.Add(&corev1.Pod{})
time.Sleep(time.Second)
// client close.
clientW.Stop()
// DEADLOCK!!!
}
Does below code work,without adding a timeout. @xigang
func() {
// w.lock.RLock()
// defer w.lock.RUnlock()
select {
case <-w.done:
return
case w.result <- copyEvent:
}
}()
@ikaven1024 Thanks for the analysis. Is there any problem in remove RLock
when multiple goroutines process w.done
and w.result
?
If RLock
is retained. Is it better to use timeout
to reduce lock race?
Is there any problem in remove
RLock
when multiple goroutines processw.done
andw.result
?
It's safe reading, writing and closing a chan in multiple goroutines.
If RLock is retained. Is it better to use timeout to reduce lock race?
It will cause client break the watch, and send a new watch request, Increasing the burden on the server.
Is there any problem in remove
RLock
when multiple goroutines processw.done
andw.result
?It's safe reading, writing and closing a chan in multiple goroutines.
If RLock is retained. Is it better to use timeout to reduce lock race?
It will cause client break the watch, and send a new watch request, Increasing the burden on the server.
@ikaven1024 I think it's ok, I resubmitted the code. To avoid the exception, I added additional recover()
for startWatchSource
.
Hi @xigang Is the current pr based on fixing #4221?
Hi @xigang Is the current pr based on fixing #4221?
@XiShanYongYe-Chang Yes, @ikaven1024 open a new PR #4221 to fix this bug.
Thanks for your quick response @xigang
@XiShanYongYe-Chang Yes, @ikaven1024 open a new PR https://github.com/karmada-io/karmada/pull/4221 to fix this bug.
We usually don't open another PR to replace the current in-review PR. I don't know which PR we should move forward with.
@xigang What do you think?
@XiShanYongYe-Chang Yes, @ikaven1024 open a new PR #4221 to fix this bug.
We usually don't open another PR to replace the current in-review PR. I don't know which PR we should move forward with.
@xigang What do you think?
@RainbowMango Agreed, in order to track this issue, let's continue to fix this issue in this PR.
Yes, we can focus on fixing this issue and keep the minimum change( so that we can easily backport this to release branches), and let #4221 focus on additional improvement.
Does that make sense to you? @xigang @ikaven1024
Yes, we can focus on fixing this issue and keep the minimum change( so that we can easily backport this to release branches), and let #4221 focus on additional improvement.
Does that make sense to you? @xigang @ikaven1024
@ikaven1024 Migrate #4221 code to this PR? Or use timeout to temporarily fix this issue?
@ikaven1024 Migrate #4221 code to this PR? Or use timeout to temporarily fix this issue?
You can migrate #4221 here, and i will close it.
@ikaven1024 Migrate #4221 code to this PR? Or use timeout to temporarily fix this issue?
You can migrate #4221 here, and i will close it.
Done.
/lgtm /approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: ikaven1024
The full list of commands accepted by this bot can be found here.
The pull request process is described here
Hi @xigang do you think we need to cherry-pick this patch to the previous branch?
cherry-pick
@XiShanYongYe-Chang I think we should cherry-pick the previous branch and let me handle it.
@XiShanYongYe-Chang I think we should cherry-pick the previous branch and let me handle it.
Thanks~
What type of PR is this?
/kind bug
What this PR does / why we need it: When the
result
channel ofwatchMux
is blocked, it will causegoroutine
leak and lock race. this problem affects the client watch api interface to hang when consumingwatch.RestChan
.fetch goroutine stack:
Through the
goroutine
stack below, there are1892
goroutines blocked inruntime_SemacquireMutex
.Refer to cacheWatcher send event to input channel to implement adding a timeout.
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher/cache_watcher.go#L216
Which issue(s) this PR fixes: Fixes #
Special notes for your reviewer: @ikaven1024 @RainbowMango @XiShanYongYe-Chang
Does this PR introduce a user-facing change?: