sewenew / redis-plus-plus

Redis client written in C++
Apache License 2.0
1.64k stars 351 forks source link

[QUESTION]Redlock seems to be invalid #388

Closed shihao-thx closed 2 years ago

shihao-thx commented 2 years ago

Describe the problem I used relock, but I found it seems to be invalid. In my scenario, there are four servers, each of which gets the lock, then increases the revision, and puts down the lock after publishing the previous revision. Finally, the same revision is found twice in the subscribe. The log:

in watchHandler, publidhed rev = 405, handler_rev = 405
in watchHandler, publidhed rev = 405, handler_rev = 405
in watchHandler, publidhed rev = 405, handler_rev = 405
in watchHandler, publidhed rev = 405, handler_rev = 405
handled_rev_.store(406)
handled_rev_.store(406)
handled_rev_.store(406)
handled_rev_.store(406)
in watchHandler, publidhed rev = 405, handler_rev = 406
in watchHandler, publidhed rev = 405, handler_rev = 406
in watchHandler, publidhed rev = 405, handler_rev = 406
in watchHandler, publidhed rev = 405, handler_rev = 406
in watchHandler, publidhed rev = 407, handler_rev = 406
in watchHandler, publidhed rev = 407, handler_rev = 406
in watchHandler, publidhed rev = 407, handler_rev = 406
in watchHandler, publidhed rev = 407, handler_rev = 406

As you can see, publidhed rev = 405 happened twice, but it should be 406 not 405 again. The way I used redlock: 1、Initialize

redis::ConnectionOptions opts;
opts.host = endpoint_host_;
opts.port = endpoint_port_;

redis::ConnectionPoolOptions pool_opts;
pool_opts.size = 3;
pool_opts.wait_timeout = std::chrono::seconds(600);
// async redis client
redis_client.reset(new redis::AsyncRedis(opts, pool_opts));
// sync redis client
syncredis_client.reset(new redis::Redis(opts, pool_opts));
// watch sync redis client
watch_client.reset(new redis::Redis(opts, pool_opts));

mtx.reset(new redis::RedMutex(*syncredis_client, "resource"));
redlock.reset(new redis::RedLock<redis::RedMutex>(*mtx, std::defer_lock));

2、Usage Only one PUBLISH and INCR command called in below commitUpdates function.

void RedisMetaService::commitUpdates(
    const std::vector<op_t>& changes,
    callback_t<unsigned> callback_after_updated) {
  // If rev or op_prefix doesn't initialize, which means errors have already happened.
  std::string rev;
  std::string op_prefix;
  unsigned irev = 0;
  redlock_->try_lock(std::chrono::seconds(600));
  // the operation number is ready to publish
  rev = *redis_->get("redis_revision").get();
  redis_->incr("redis_revision").get();
  op_prefix = "op" + rev;
  irev = static_cast<unsigned>(std::stol(rev));

  ......

  redis_->command<long long>(
                              "PUBLISH", "operations", irev,
                              [self, callback_after_updated,
                              irev](redis::Future<long long>&& resp) {
  });

  ......

  redlock_->try_unlock();
}

I simplify the usage, if you have more time, please check out the source code in my project. requestLock, commitUpdates, call requestLock. Thank you very much.

Regards

Environment: OS: WSL for Ubuntu 20.04 in docker Compiler: GCC 9.4.0 hiredis version: the latest version redis-plus-plus version: the latest version

sewenew commented 2 years ago

Sorry, but I cannot reproduce your problem with the following code, which implements the logic you described above.

Your code might have a problem that you didn't check how much time still left for your operations. You can try the following code, if it's still have problem feel free to let me know.

void worker_func() {
        ConnectionOptions opts;
        opts.host = "127.0.0.1";
        opts.port = 6379;
        ConnectionPoolOptions pool_opts;
        pool_opts.size = 3;
        auto r = Redis(opts, pool_opts);
        for (auto idx = 0; idx < 20; ++idx) {
                try {
                        RedMutex mtx(r, "rev-lock");
                        RedLock<RedMutex> lock(mtx, std::defer_lock);
                        lock.try_lock(std::chrono::seconds(30));

                        // Ensure we have enough time to do the following operations.
                        // The timeout depends on the status of your network.
                        auto time_left = lock.ttl();
                        if (time_left > std::chrono::seconds(1)) {
                                auto rev = r.get("rev");
                                r.incr("rev");
                                fprintf(stderr, "tid: %s, get rev %s\n", to_string(pthread_self()).data(), rev->data());
                                r.publish("operations", "incr rev: " + *rev);
                        } else {
                                fprintf(stderr, "tid: %s, time left: %ld\n", to_string(pthread_self()).data(), time_left.count());
                        }
                        // lock will be released automatically.
                } catch (const Error &e) {
                        fprintf(stderr, "tid: %s, failed to lock: %s\n", to_string(pthread_self()).data(), e.what());
                }
        }
}

int main() {
        try {
                ConnectionOptions opts;
                opts.host = "127.0.0.1";
                opts.port = 6379;
                ConnectionPoolOptions pool_opts;
                pool_opts.size = 3;
                auto r = Redis(opts, pool_opts);
                // Ensure rev number exist and resetted.
                r.set("rev", "0");
                auto sub = r.subscriber();
                sub.on_message([](string channel, string msg) {
                                        fprintf(stderr, "tid: %s, channel: %s, msg: %s\n", to_string(pthread_self()).data(), channel.data(), msg.data());
                                });
                sub.subscribe("operations");
                vector<thread> workers;
                for (auto idx = 0; idx < 4; ++idx) {
                        workers.emplace_back(worker_func);
                }

                while (true) {
                        sub.consume();
                }

                for (auto &worker : workers) {
                        worker.join();
                }
        } catch (const Error &e) {
                cerr << "failed to run app: " << e.what() << endl;
        }
        return 0;
}

The output should be something like the following:

tid: 139665058457344, get rev 0
tid: 139665075246912, channel: operations, msg: incr rev: 0
tid: 139665066850048, time left: -5552361
tid: 139665050064640, time left: -5552363
tid: 139665058457344, get rev 1
tid: 139665066850048, time left: -5552368
tid: 139665075242752, time left: -5552369
tid: 139665075246912, channel: operations, msg: incr rev: 1
tid: 139665050064640, time left: -5552371
tid: 139665066850048, time left: -5552372
tid: 139665075242752, time left: -5552372
tid: 139665058457344, get rev 2
tid: 139665066850048, time left: -5552377
tid: 139665075242752, time left: -5552378
tid: 139665075246912, channel: operations, msg: incr rev: 2
tid: 139665066850048, time left: -5552380
tid: 139665075242752, time left: -5552381
tid: 139665075242752, time left: -5552387
tid: 139665058457344, get rev 3
tid: 139665066850048, time left: -5552388
tid: 139665075246912, channel: operations, msg: incr rev: 3
tid: 139665050064640, time left: -5552395
tid: 139665066850048, time left: -5552395
tid: 139665075242752, get rev 4
tid: 139665075246912, channel: operations, msg: incr rev: 4
tid: 139665058457344, time left: -5552398
tid: 139665058457344, get rev 5
tid: 139665075242752, time left: -5552402
tid: 139665066850048, time left: -5552403
tid: 139665075246912, channel: operations, msg: incr rev: 5
tid: 139665075242752, time left: -5552405
tid: 139665058457344, get rev 6
tid: 139665075242752, time left: -5552414
tid: 139665050064640, time left: -5552416
tid: 139665075246912, channel: operations, msg: incr rev: 6
tid: 139665075242752, time left: -5552418
tid: 139665050064640, time left: -5552418
tid: 139665050064640, get rev 7
tid: 139665075246912, channel: operations, msg: incr rev: 7
tid: 139665050064640, get rev 8
tid: 139665075246912, channel: operations, msg: incr rev: 8
tid: 139665058457344, time left: -5552427
tid: 139665050064640, get rev 9
tid: 139665075246912, channel: operations, msg: incr rev: 9
tid: 139665050064640, get rev 10
.......

Regards

shihao-thx commented 2 years ago

OK, thanks a lot!

Regards

shihao-thx commented 2 years ago
            try {
                    RedMutex mtx(r, "rev-lock");
                    RedLock<RedMutex> lock(mtx, std::defer_lock);
                    lock.try_lock(std::chrono::seconds(30));

                    // Ensure we have enough time to do the following operations.
                    // The timeout depends on the status of your network.
                    auto time_left = lock.ttl();
                    if (time_left > std::chrono::seconds(1)) {

Hi, @sewenew . I don't use the above to check how much time still left for my operations. Because I set a long time to lock. redlock_->try_lock(std::chrono::seconds(6000)); My program doesn't run so long. Is it OK? For verifying my guess that the redlock seems to be invalid. I deleted the redlock_->unlock(); at the end of commitUpdates function. I found the program can still run like before. But I argue that it shloud be blocked, because there is no unlock and the lock time is so long. Yes, what happened is so strange, and maybe it's due to bugs that is hard to found in my program that uses a lot of boost::io_context::post to accomplish concurrent, although I gave up a lot to verify the redlock. I notice that you used destructor to unlock implicitly, but for extending the redlock's live to the whole program, I new a std::shared_ptr<redis::RedLock<redis::RedMutex>> redlock_; it shouldn't have unlock happened until the program finished, right? mtx.reset(new redis::RedMutex(syncredis_client, "resource")); redlock.reset(new redis::RedLock(mtx, std::defer_lock));

Could you give me more suggestions to debug or something else? Thanks a lot.

Regards

shihao-thx commented 2 years ago

Here we can clearly see two processes(from pid) get the same revision. The redlock didn't prevent the other process getting the revision after one process incr the revision.

handled_rev_.store(402)
     in watchHandler, publidhed rev = 18950:402, handler_rev = 402
     in watchHandler, publidhed rev = 18950:402, handler_rev = 402
     in watchHandler, publidhed rev = 18950:402, handler_rev = 402
     in watchHandler, publidhed rev = 18950:402, handler_rev = 402
handled_rev_.store(403)
     in watchHandler, publidhed rev = 18949:402, handler_rev = 403
handled_rev_.store(403)
     in watchHandler, publidhed rev = 18949:402, handler_rev = 403
handled_rev_.store(403)
     in watchHandler, publidhed rev = 18949:402, handler_rev = 403
handled_rev_.store(403)
     in watchHandler, publidhed rev = 18949:402, handler_rev = 403
shihao-thx commented 2 years ago

My OS environment is WSL for Ubuntu 20.04 in docker.

Regards

sewenew commented 2 years ago

My bad! I doubled checked the code, and the documentation is incorrect!

RedLock::try_lock returns true if locking successfully, otherwise, it returns false (instead of throwing exception). So you always need to check the return value of try_lock.

if (lock.try_lock(chrono::seconds(60))) {
    // lock OK
} else {
    // failed to lock.
}

So sorry about the misleading comments above...

Regards

shihao-thx commented 2 years ago

RedLock::try_lock returns true if locking successfully, otherwise, it returns false (instead of throwing exception). So you always need to check the return value of try_lock.

Thanks for your help. Yes, the lock failed in my program, but why and how can I solve it? That may be a key solution for my program bug.

Regards

sewenew commented 2 years ago

As I mentioned, you need to test if try_lock returns true (see the code I posted recently). If it returns true, you locked successfully, and you can continue your work. Otherwise, you failed to get the lock, and should abort your work.

Regards

sewenew commented 2 years ago

So far, there's no way to blocking until you get the lock, unless you do that in your application code:

while (!lock.try_lock(chrono::seconds(60))) {
    // continue to try. maybe sleep a while before retrying.
}

// Finally, get the lock 
do_work();

Regards

shihao-thx commented 2 years ago

OK, thanks a lot! But for unlock, should have a while to block?

Regards

shihao-thx commented 2 years ago

I notice unlock return void. Maybe only handling the exception is OK?

sewenew commented 2 years ago

Yes, no need to do that for unlock.

Regards

sewenew commented 2 years ago

Since there's no update, I'll close this issue.

Regards