python-zk / kazoo

Kazoo is a high-level Python library that makes it easier to use Apache Zookeeper.
https://kazoo.readthedocs.io
Apache License 2.0
1.3k stars 387 forks source link

Zookeeper sequence node overflow causes perpetual re-election #730

Open angxiang opened 1 year ago

angxiang commented 1 year ago

Expected Behavior

When the sequence zookeeper node counter overflows (as documented in: https://zookeeper.apache.org/doc/r3.6.0/zookeeperProgrammers.html#Sequence+Nodes+--+Unique+Naming), it would be expected that there is one re-election from 2147483647 to -2147483648. Afterwards, incrementing in the negative space should be stable again and not causing unexpected re-elections.

Actual Behavior

When overflowing and reaching the negative space starting from -2147483648, we observed perpetual leader re-elections. With each new added sequence number, the new instance would register itself as the new leader.

When digging into the kazoo code, it seems to be a typing issue, as the sequence is matched using a regex but not cast to an integer, meaning checking for the smallest sequence number is a String comparison, i.e. "-2147483648" > "-2147483647" is True.

Snippet to Reproduce the Problem

N/A

Logs with logging in DEBUG mode

N/A

Specifications

angxiang commented 1 year ago

I have created a pull request but it is currently marked with failing checks due to:

Please have a look at the PR anyway :) Thanks!

ceache commented 1 year ago

Hi Angie,

Thank you for reporting the issue. You are however incorrect in your approach to the resolution. Serial Number Arithmetics ( https://en.wikipedia.org/wiki/Serial_number_arithmetic) should be used to manage sequence numbers.

For instance, with your PR, once the sequence numbers are negative, the new contenders would always "win" since they would have a smaller integer number than the last positive sequence.

I have/had a draft solution for the issue I had reported here https://github.com/python-zk/kazoo/issues/606. Would you be interested in testing it out?

Cheers,

On Mon, Oct 9, 2023 at 11:02 AM Angie Xiang @.***> wrote:

I have created a pull request but it is currently marked with failing checks due to:

Please have a look at the PR anyway :) Thanks!

— Reply to this email directly, view it on GitHub https://github.com/python-zk/kazoo/issues/730#issuecomment-1753181530, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIFTHXGJQO2EOPKWGPTTALX6QGZZAVCNFSM6AAAAAA5YPTXY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJTGE4DCNJTGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Charles-Henri de Boysson

angxiang commented 1 year ago

Hi @ceache ,

thank you for the quick reply! I have a question regarding:

once the sequence numbers are negative, the new contenders would always "win" since they would have a smaller integer number than the last positive sequence

Please correct me if I am wrong, but this should only happen exactly once when the wrap around happens. Our use case in which we have observed the issue is electing a new leader instance. We hit the overflow case and observed that when >1 instance was running, they were stuck in a perpetual loop of "new leader is forcibly elected -> old leader is killed -> new instance comes online and registers as new leader".

The PR I opened should work as a bug fix for our case, because we would then have exactly one forced re-election when wrapping around from 2147483647 to -2147483648. But after that, with every increment the new negative sequence number only gets larger, and the smallest sequence number still remains at -2147483648. In a concrete example min(2147483646, 2147483647, -2147483648, -2147483647) = -2147483648.

Am I missing a different use case in which my change could become an issue?

ceache commented 1 year ago

Hi,

You keep saying "leader election" but you mean lock owner ("leader election" has special meaning in Zookeeper).

So, in your example, let's say we have 3 clients contending for a lock, A, B and C.

A gets there first, the sequence number is 2147483646, and he obtains the lock (enters critical section). B, in turn, tries to acquire the lock, its sequence number is 2147483647, it can't since A has a lower sequence number so it waits. C also tries, its sequence number is now, -2147483648, it thinks it has the lowest number so it "obtains" the lock. We now have A and C in the critical section we were trying to protect.

Your patch improves the situation because it considers them as integers so it at least has the "concept" of negative number increase right.

min(2147483646, 2147483647, -2147483648, -2147483647) -2147483648

Without it every new contender is the "lowest" once the sequence numbers wrap into negative.

min("2147483646", "2147483647", "-2147483648", "-2147483647") '-2147483647' But the issue, even if it is a "once in a wrap" issue, remains with two contenders entering the critical section.

I hope this helps explain what I was trying to say.

Cheers,

On Mon, Oct 9, 2023 at 11:45 AM Angie Xiang @.***> wrote:

Hi @ceache https://github.com/ceache ,

thank you for the quick reply! I have a question regarding:

once the sequence numbers are negative, the new contenders would always "win" since they would have a smaller integer number than the last positive sequence

Please correct me if I am wrong, but this should only happen exactly once when the wrap around happens. Our use case in which we have observed the issue is electing a new leader instance. We hit the overflow case and observed that when >1 instance was running, they were stuck in a perpetual loop of "new leader is forcibly elected -> old leader is killed -> new instance comes online and registers as new leader".

The PR I opened should work as a bug fix for our case, because we would then have exactly one forced re-election when wrapping around from 2147483647 to -2147483648. But after that, with every increment the new negative sequence number only gets larger, and the smallest sequence number still remains at -2147483648. In a concrete example min(2147483646, 2147483647, -2147483648, -2147483647) = -2147483648.

Am I missing a different use case in which my change could become an issue?

— Reply to this email directly, view it on GitHub https://github.com/python-zk/kazoo/issues/730#issuecomment-1753254236, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIFTHTSQAMFI5NHQY4Z6O3X6QL2BAVCNFSM6AAAAAA5YPTXY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJTGI2TIMRTGY . You are receiving this because you were mentioned.Message ID: @.***>

-- Charles-Henri de Boysson

angxiang commented 1 year ago

Hi,

yes, sorry for the confusing terminology, I should be saying "lock owner" 👍 I kept saying "leader election" because we use it to elect a leader in our system. I think I understand what you mean, we may be looking at the problem from different angles?

If I am not mistaken, what you are saying is that my change does not produce the correct result in this critical section. With the example you gave above: A = 2147483646, B = 2147483647 and C = -2147483648, the correct owner of the lock is A, but the actual owner is C. Even continuing with this thought, if A is killed and there is a new instance A_new trying to acquire the lock, it would be A_new = -2147483647 and the lock owner would still be C, which would still be an incorrect state because now B should be the owner.

However my approach is from a user perspective, because my goal was to stabilise the whole system's state once this overflow is reached. Without the proposed change, the lock owner is perpetually changed as with every single new instance, the lock is handed over, since we are comparing Strings. This caused our system to become unusable a few days ago until we deleted and recreated the znode, thus resetting the sequence. The change would prevent what we encountered and prevent a major incident, even though the result of who gets to acquire the lock is technically incorrect while in this critical section.

I agree that the serial number arithmetics which you have linked would probably be the better and more correct solution. From our side, this is not a critical fix anymore as it took us 6 years to reach the point of overflow, so I expect that we will have another few years again before it becomes a problem. Nevertheless it could be a good intermediate fix, depending on how long / difficult it is to implement the sequence number arithmetics, as it does improve the situation, does not worsen it in any way and may prevent a major incident in someone else's system.

Best regards, Angie

PS: There will be a delay in response until tomorrow as it is already past EOB in Central Europe, sorry about that! If there is a new reply, I will get back to you tomorrow 👋 PPS: issue #606 has no linked change, but I would be interested in seeing your draft! Is there a link to it?

angxiang commented 1 year ago

Hi,

I have taken another look at this issue and have a few more questions.

A gets there first, the sequence number is 2147483646, and he obtains the lock (enters critical section). B, in turn, tries to acquire the lock, its sequence number is 2147483647, it can't since A has a lower sequence number so it waits. C also tries, its sequence number is now, -2147483648, it thinks it has the lowest number so it "obtains" the lock. We now have A and C in the critical section we were trying to protect. [...] But the issue, even if it is a "once in a wrap" issue, remains with two contenders entering the critical section.

Just to confirm: Your issue with my change casting the sequence numbers to an integer is that the result as to who gets to acquire the lock is wrong: The sequence number that is still at 2147483646 should be the lock owner but -2147483648 is the actual lock owner. Or is there another issue that I am overlooking?

I have read the Sequence Number Arithmetic article that you linked, but if I am not mistaken, it doesn't describe a full solution either; it ends with:

All sequence number arithmetic must deal with "wrapping" of sequence numbers; the number 2N−1 is equidistant in both directions, in RFC 1982 sequence number terms. In our math, they are both considered to be "less than" each other:

distance1 = (signed)(0x8000 - 0x0)    == (signed)0x8000 == -32768 < 0
distance2 = (signed)(0x0    - 0x8000) == (signed)0x8000 == -32768 < 0

This is obviously true for any two sequence numbers with distance of 0x8000 between them.

And then does not propose a solution for this edge case. So in any case, even with sequence number arithmetics, we still have an edge case that produces strange results.

When looking at it from a different angle:

I have a proposal based on this comment taken from kazoo/recipe/lock.py

# Only consider contenders with a smaller sequence number. # A contender with a smaller sequence number has a higher # priority.

With the integer cast in my PR, this statement is mathematically correct when looking at the number in the lock contender name. From a human sense standpoint, I would expect 2147483647 to be followed by 2147483648, but this is not the case for integers constrained by 32 bit. Taking the string number and then casting to int still fulfils the "contract" stated in this comment.

What do you think?

Best regards, Angie

angxiang commented 1 month ago

Hello,

we have encountered this issue again. If my understanding is correct, the proposed pull request is not being accepted as it is behaving incorrectly during the overflow period, though it would be stable.

The outlined solution in https://en.wikipedia.org/wiki/Serial_number_arithmetic mentions a remaining edge case regarding equidistant numbers, as well as a difference in the implementation depending on the machine's integer size. We're unsure how zookeeper behaves on different machines (i.e. 32-bit vs 64-bit). Is there already a fix in the works on your end? You mentioned having a draft change in relation to #606, but it isn't linked there.

Best regards, Angie