google / seesaw

Seesaw v2 is a Linux Virtual Server (LVS) based load balancing platform.
Apache License 2.0
5.63k stars 511 forks source link

Fix the race condition in TestSyncHeartbeats test case #93

Closed unicell closed 4 years ago

unicell commented 4 years ago

Context: It is found in CI pipeline that TestSyncHeartbeats case could fail occasionally, with the following error message.

--- FAIL: TestSyncHeartbeats (6.53s)
    sync_test.go:156: Got Desynchronisation sync note
    sync_test.go:154: Got sync note Desynchronisation, want Heartbeat
    sync_test.go:156: Got Desynchronisation sync note
    sync_test.go:156: Got Heartbeat sync note

And it is caused by the 3 lines this change is trying to remove.

Analysis: syncClient.poll() triggers an asynchronous rpc call on Poll method, which may (rarely happen) or may not (most cases) completes before syncClient got disabled. Should Poll actually succeeded, sync client will then receive an extra SNTDesync note, and fail the test case with the above message.

This change removes the racy lines from the test case and let the rest code in TestSyncHeartbeats cover the flow of heartbeat note.

DrJosh9000 commented 4 years ago

I haven't looked closely at all this, but looking through history, this part of the test was apparently added to detect a potential deadlock caused by holding the sync client lock while writing to the start or quit channels. One of our changes that hasn't been upstreamed yet added a bunch of helpful comments, and made it more precise with dispatcher.nextNote (e.g. calling dispatcher.nextNote between the client enable/disable here, and checking it is SNTDesync), as opposed to removing this.

unicell commented 4 years ago

@DrJosh9000 That makes sense. Thanks for sharing! Cause in its current form, it was not clear to me what it is testing against.

So would like to hear your recommendation here? I can either add a similar logic here to test SNTDesync, or hold on this one and wait for the internal patch to be upstreamed.

DrJosh9000 commented 4 years ago

I'll try to have a PR upstreaming the patch today, but in the meantime merging this LGTM.