python-zk / kazoo

Kazoo is a high-level Python library that makes it easier to use Apache Zookeeper.
https://kazoo.readthedocs.io
Apache License 2.0
1.3k stars 387 forks source link

fix(core): resolve race in IAsyncResult.wait() #487

Closed jgh-ds closed 7 years ago

jgh-ds commented 7 years ago

closes #485

jgh-ds commented 7 years ago

I thought about adding a test, but I didn't do it because a failure would cause the test code to hang forever. However, it would be very simple to do. Should I add one?

jgh-ds commented 7 years ago

I didn't mean to close it.

jgh-ds commented 7 years ago

Updated jgh-ds:master with a new test case for the issue: test_wait_race()

jgh-ds commented 7 years ago

It should be ready to merge now, including a test case.

Thanks.

jgh-ds commented 7 years ago

Added the doc string and removed the unnecessary try/except from the test case. Also verified that failing the test case will not hang the testing process.

jeffwidman commented 7 years ago

I'm going to go ahead and merge. The test is the only part I'm slightly unsure of, but I am confident the race condition exists and your fix looks good.

jgh-ds commented 7 years ago

I've determined that the part of the test which is:

        if not cv.is_set():
            async_result.set("kick")

is not needed (but is harmless). I was thinking that it would be necessary in the failure case, but the assertion failure from eq() keeps it from being run in that case anyway. Since the thread has a timeout anyway, it cannot hang the testing process.

Would it be better to remove it?

jeffwidman commented 7 years ago

Yes, please remove it. No one later will realize it can be removed.

Re: the merge, I accidentally clicked the wrong button, then had to wait til tests passed... once you make this change and tests pass, I will merge.

jgh-ds commented 7 years ago

The dead code in the test is removed now.

jgh-ds commented 7 years ago

Apparently the removed code is needed in Python 3.5 and 3.6 to avoid hanging.

jgh-ds commented 7 years ago

The test failures are not on the new test case. The actual IAsyncResult.wait() code is the same as the first commit, where all the tests pass.

Could other test cases depend on the old (broken) behavior? The latest errors are on test_race_condition_new_partitioner_steals_the_lock

jgh-ds commented 7 years ago

I've verified that the failing test case in test_race_condition_new_partitioner_steals_the_lock does not use the modified IAsyncResult.wait() code. It appears that test_race_condition_new_partitioner_steals_the_lock is sensitive to execution performance (using 0.1 second and 0.2 second time constants). Perhaps the Travis environment is slower now.

jgh-ds commented 7 years ago

It should be good to go now.