redis / rueidis

A fast Golang Redis client that supports Client Side Caching, Auto Pipelining, Generics OM, RedisJSON, RedisBloom, RediSearch, etc.
Apache License 2.0
2.47k stars 158 forks source link

rueidislock CPU highload #602

Closed UncleVic closed 3 months ago

UncleVic commented 3 months ago

When I try to lock a context with WithContext I have an extremely high loading of CPU >100%

I tried to refactor my code and replace WithContext with TryWithContext and a ticker. Thereafter, the CPU load has dropped to 1%

I create a locker with those parameters

    locker, err := rueidislock.NewLocker(rueidislock.LockerOption{
        ClientOption:   rueidis.ClientOption{InitAddress: []string{redisHost}, Password: Password},
        KeyPrefix:      "lock:my_locker",
        KeyValidity:    10 * time.Minute,
        ExtendInterval: 30 * time.Second,
        NoLoopTracking: true,
    })

I attempted to play with different parameters, but unfortunately haven't had a positive result :(

Redis 7.2.4

rueian commented 3 months ago

Hi @UncleVic, which rueidis version are you using?

rueian commented 3 months ago

Besides WithContext, what other methods are you using?

UncleVic commented 3 months ago

Hi @UncleVic, which rueidis version are you using?

v1.0.39

Besides WithContext, what other methods are you using?

I'm using ForceWithContext and now I'm using TryWithContext. They work quickly enough.

UncleVic commented 3 months ago

Now I upgraded the lib to v1.0.43. I have the same result.

rueian commented 3 months ago

Yes, this is a bug and I have found the root cause. I will release a fix soon. Thank you for reporting.

dangngoctam00 commented 3 months ago

hello @rueian ~, could you briefly talk about the root cause? I occasionally could not acquire lock when using WithContext with timeout 10 seconds, here is some information:

I am still trying to find the root cause based on some information such as metrics from the redis cluster and client code. Thank you.

rueian commented 3 months ago

Hi @UncleVic, @dangngoctam00

There was a race in the WithContext which could lead to two issues:

  1. Busy retries, which put a high load on the CPU.
  2. Among these retries, the client-side caching event watcher kept being removed and added, which can lead to missing events and exceeding the deadline.

A fix is out https://github.com/redis/rueidis/pull/604 and there is a pre-release https://github.com/redis/rueidis/releases/tag/v1.0.44-alpha.1 based on the fix.

Please let me know if the fix solves your problems.

dangngoctam00 commented 3 months ago

hello @rueian , about item 2, do you mean monitoring function? I just want to know more and explain the problem I'm facing. Thank you very much.

rueian commented 3 months ago

Hi @dangngoctam00,

Yes, previously at the end of the monitoring function, we removed the watcher anyway if its reference count (g.w) decreased to 0. That could cause we missed some events from client side caching. The new implementation will not remove the watcher until either there is no WithContext or an acquired lock.

dangngoctam00 commented 3 months ago

Hi @rueian , after checking your message and reading the code again, I have some thought and could you help me review it?

The client get error context deadline, based on the code at version v1.0.37, this error is only returned from waitgate function. There are 2 cases:

Do you have any idea about it. Thank you.

rueian commented 3 months ago

Hi @dangngoctam00,

In the v1.0.37, WithContext would return a deadline exceeded error only when the following line https://github.com/redis/rueidis/blob/b514a567d88619c445c188248bae2515d166b0e4/rueidislock/lock.go#L212 was not triggered.

So, the problem was actually two questions:

  1. Did it make sense that the waitgate entered the select case section in your scenario? The first waitgate call shouldn't go into that section but other concurrent waitgate calls should.
  2. Why was the <-g.ch case not triggered? This must be a bug.

Note that we can only fix the bug in the new version.

dangngoctam00 commented 3 months ago

Hi @rueian , I will try to check if there is any concurrency based on your item 1.

Note that we can only fix the bug in the new version.

Yes, I understand, thank you.

rueian commented 3 months ago

Hi @dangngoctam00,

the original CPU highload issue has been solved by the #604. If you have new discoveries, please let me know in a new issue.