tikv / pd

Placement driver for TiKV
Apache License 2.0
1.05k stars 719 forks source link

TestGetTSOImmediately is flaky #8533

Closed lhy1024 closed 1 month ago

lhy1024 commented 2 months ago

Flaky Test

Which jobs are failing

2024-08-15T06:17:50.3934438Z     testutil.go:67: 
2024-08-15T06:17:50.3935833Z            Error Trace:    /home/runner/work/pd/pd/pkg/utils/testutil/testutil.go:67
2024-08-15T06:17:50.3938318Z                                        /home/runner/work/pd/pd/tests/integrations/mcs/tso/keyspace_group_manager_test.go:772
2024-08-15T06:17:50.3939789Z            Error:          Condition never satisfied
2024-08-15T06:17:50.3940870Z            Test:           TestGetTSOImmediately

CI link

https://github.com/tikv/pd/actions/runs/10399520880/job/28798931505

Reason for failure (if possible)

Anything else

rleungx commented 1 month ago
Screenshot 2024-09-05 at 14 53 13

One participant seems to be resetting due to the priority, while another participant keeps skipping campaigning due to the expected primary.

HuSharp commented 1 month ago

There are two TSOs, 36295 and 40455.

  1. 36295 is selected as primary.
  2. Set 40455's priority to be bigger, 40455's priority check will start the election after it finds out.
  3. 40455 skips the campaign because the Expected Primary has a value.
  4. At this point, 36295 also exits the primary and deletes the Expected Primary.
  5. Then 36295 campaign faster, and is elected again (which amplifies the time difference mentioned above). will repeat the cycle
HuSharp commented 1 month ago

Root cause

Further more, the root cause is tso's priority uses ResetLeader instead of moveLeader.

The tso priority checker process is:

Secondary can be elected as new primary because of time gap which is not stable!!!

For example, if the secondary io jitters and doesn't elected as new primary, the old primary will be elected, and then it will loop through the priority check logic again.

The better solution is