mailgun / gubernator

High Performance Rate Limiting MicroService and Library
Apache License 2.0
964 stars 99 forks source link

fix: mutex deadlocks in PeerClient #223

Closed miparnisari closed 8 months ago

miparnisari commented 8 months ago

Description

Close https://github.com/mailgun/gubernator/issues/222.

The mutex deadlock is proven by the flaky TestPeerClientShutdown:

[25/02/24 7:09:39] ~/GitHub/gubernator (master) $ go test ./... -run=TestPeerClientShutdown -count=1 -race -timeout=10s
?       github.com/mailgun/gubernator/v2/cmd/gubernator [no test files]
?       github.com/mailgun/gubernator/v2/cmd/gubernator-cli [no test files]
?       github.com/mailgun/gubernator/v2/cmd/gubernator-cluster [no test files]
?       github.com/mailgun/gubernator/v2/cmd/healthcheck    [no test files]
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9990"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9990 ..." instance="127.0.0.1:9990"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9980 ..." instance="127.0.0.1:9990"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9991"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9991 ..." instance="127.0.0.1:9991"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9981 ..." instance="127.0.0.1:9991"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9992"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9992 ..." instance="127.0.0.1:9992"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9982 ..." instance="127.0.0.1:9992"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9993"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9993 ..." instance="127.0.0.1:9993"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9983 ..." instance="127.0.0.1:9993"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9994"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9994 ..." instance="127.0.0.1:9994"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9984 ..." instance="127.0.0.1:9994"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9995"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9995 ..." instance="127.0.0.1:9995"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9985 ..." instance="127.0.0.1:9995"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9890"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9890 ..." instance="127.0.0.1:9890"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9880 ..." instance="127.0.0.1:9890"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9891"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9891 ..." instance="127.0.0.1:9891"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9881 ..." instance="127.0.0.1:9891"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9892"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9892 ..." instance="127.0.0.1:9892"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9882 ..." instance="127.0.0.1:9892"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9893"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9893 ..." instance="127.0.0.1:9893"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9883 ..." instance="127.0.0.1:9893"
panic: test timed out after 10s
running tests:
    TestPeerClientShutdown (10s)
    TestPeerClientShutdown/Global (10s)

goroutine 566 [sync.RWMutex.Lock]:
sync.runtime_SemacquireRWMutex(0xc000fefedc?, 0xbc?, 0xc000fefec8?)
    /opt/homebrew/Cellar/go/1.21.6/libexec/src/runtime/sema.go:87 +0x28
sync.(*RWMutex).Lock(0xc000fefec8)
    /opt/homebrew/Cellar/go/1.21.6/libexec/src/sync/rwmutex.go:152 +0x88
github.com/mailgun/gubernator/v2.(*PeerClient).connect(0xc000fefe00, {0xc00150a1a0?, 0x1?})
    /Users/maria.inesparnisari/GitHub/gubernator/peer_client.go:123 +0x3b4
github.com/mailgun/gubernator/v2.(*PeerClient).getPeerRateLimitsBatch(0xc000fefe00, {0x1054e7e50, 0x1063ddd60}, 0xc00086a200)
    /Users/maria.inesparnisari/GitHub/gubernator/peer_client.go:299 +0x170
github.com/mailgun/gubernator/v2.(*PeerClient).GetPeerRateLimit(0xc000fefe00, {0x1054e7e50, 0x1063ddd60}, 0xc00086a200)
    /Users/maria.inesparnisari/GitHub/gubernator/peer_client.go:199 +0x4fc
github.com/mailgun/gubernator/v2_test.TestPeerClientShutdown.func1.1()
    /Users/maria.inesparnisari/GitHub/gubernator/peer_client_test.go:71 +0x12c
created by github.com/mailgun/gubernator/v2_test.TestPeerClientShutdown.func1 in goroutine 565
    /Users/maria.inesparnisari/GitHub/gubernator/peer_client_test.go:68 +0x3cc

Sometimes this test also gives "panic: send on close channel". It happens because when you call shutdown, we are closing the channel, but we are not preventing any further requests, which will then try to write to that channel.

Testing

With this PR:

25/02/24 7:11:21] ~/GitHub/gubernator (fix-peer-client) $ go test ./... -run=TestPeerClientShutdown -count=10 -race -timeout=5s
?       github.com/mailgun/gubernator/v2/cmd/gubernator [no test files]
?       github.com/mailgun/gubernator/v2/cmd/gubernator-cli [no test files]
?       github.com/mailgun/gubernator/v2/cmd/gubernator-cluster [no test files]
?       github.com/mailgun/gubernator/v2/cmd/healthcheck    [no test files]
ok      github.com/mailgun/gubernator/v2    7.453s
ok      github.com/mailgun/gubernator/v2/cluster    2.966s [no tests to run]
[25/02/24 7:11:37] ~/GitHub/gubernator (fix-peer-client) $
thrawn01 commented 8 months ago

This code is overly complex and I loath changing it. 😭

I'm probably missing something obvious, but where are we demonstrating there is a a deadlock and where is the deadlock?

miparnisari commented 8 months ago

This code is overly complex and I loath changing it. 😭

I'm probably missing something obvious, but where are we demonstrating there is a a deadlock and where is the deadlock?

The deadlock is proven by the flaky test. (If you run go test with -count=100 you will see it)

Sometimes I get "panic: send on close channel". It happens because when you call shutdown, we are closing the channel, but we are not preventing any further requests, which will then try to write to that channel

And sometimes the test times out, it's because one of the lines of code never finishes acquiring the mutex. I forget which line it was

Baliedge commented 8 months ago

@miparnisari Is this PR ready for merge? Not concerned about the go-bench error. But the PR is still in draft mode.

miparnisari commented 8 months ago

@Baliedge i don't know what to do about the newly failing TestLeakyBucketDivBug. It seems to be failing because of the immediately preceding test TestHealthCheck. When I run it in isolation, it passes:

[26/02/24 12:36:35] ~/GitHub/gubernator (fix-peer-client) $ go test ./... -run=TestLeakyBucketDivBug -race -count=1 -p=1
ok      github.com/mailgun/gubernator/v2        2.411s
ok      github.com/mailgun/gubernator/v2/cluster        2.014s [no tests to run]
?       github.com/mailgun/gubernator/v2/cmd/gubernator [no test files]
?       github.com/mailgun/gubernator/v2/cmd/gubernator-cli     [no test files]
?       github.com/mailgun/gubernator/v2/cmd/gubernator-cluster [no test files]
?       github.com/mailgun/gubernator/v2/cmd/healthcheck        [no test files]
[26/02/24 12:36:47] ~/GitHub/gubernator (fix-peer-client) $ 

I have the suspicion that TestHealthCheck is setting lastErr and TestLeakyBucketDivBug is regurgitating that.

Baliedge commented 8 months ago

@Baliedge i don't know what to do about the newly failing TestLeakyBucketDivBug. It seems to be failing because of the immediately preceding test TestHealthCheck.

Since TestHealthCheck causes a server side shutdown and restart, I wonder if it's not ready in time for the next test.

miparnisari commented 8 months ago

@Baliedge make test still fails locally 😭

image
Baliedge commented 8 months ago

@miparnisari I noticed after merging. It does work on retry. If you have an idea how to make it better, let's try it.