support using failpoints from unit tests without requiring serial test execution

rodesai commented 3 weeks ago

Is your feature request related to a problem? Please describe. Currently failpoints cannot be used from tests without requiring all the tests that hit failpoints to be executed serially. This essentially prevents us from using failpoints from our unit tests without forcing tests to run with a single testing thread. The suggested approach is to place failpoint tests under the tests tree so that they are executed as rust integration tests. However, this means that the tests cannot use any interfaces that are not exposed by the crate, which makes it difficult to write most of our test cases.

Describe the solution you'd like The reason for this restriction is that failpoints uses a global failpoint registry to control fault injections. It would be nice if there were a way to set up failpoints so that they could use a test-specific registry. One approach to doing this is to support specifying the failpoint registry in calls to the fail_point! macros, e.g.

let registry = <construct or accept a passed in failpoint registry>
fail_point!(&registry, "fail-a-thing", |_| std::io::Error::new(...))

Describe alternatives you've considered We considered running tests serially and moving our fault tests to tests. Running tests serially might work for now, but could lead to longer build times later. The bigger problem is that it causes tests to fail by default, so developers in our project would always need to remember to run tests with a single thread and configure their ide to do the same, which is painful. Putting tests in tests is not ideal because it requires us to expose a lot of interfaces from our crate that we don't want to to write the tests we want to write.

BusyJay commented 3 weeks ago

See also #51. In TiKV, we solve this by splitting test cases into different groups and launch the multiple process to run those test group concurrently. Test case in the same test group is run sequentially.

criccomini commented 3 weeks ago

@BusyJay I see #51 is still open. Two questions:

Any docs or pointers on how to split into groups?
Will this work for module unit tests in the same .rs file, not just integration tests in tests?

The reason I ask (2) is because we have some APIs we want to test but not expose publicly.

BusyJay commented 3 weeks ago

Any docs or pointers on how to split into groups?

No, it's just some scripts.

Will this work for module unit tests in the same .rs file, not just integration tests in tests?

Yes. We used to manually collecting built test binaries and list all available cases. Now we use cargo nextest to make the process more maintainable. You can find all the steps in https://do.pingcap.net/jenkins/blue/organizations/jenkins/tikv%2Ftikv%2Fpull_unit_test/detail/pull_unit_test/995/pipeline/71, build step and test step specifically.

criccomini commented 2 weeks ago

Any interest in taking a PR that allows users to pass a failpoint registry in rather than use the global?

criccomini commented 2 weeks ago

Here is our forked repo, for reference:

https://github.com/slatedb/fail-parallel

And here's the actual commit that adds registry parameter support:

https://github.com/slatedb/fail-parallel/commit/f7f020aceecf3eff6794a149ba7297de4010d1e3

Now users may supply a registry when the call various fail_point/cfg/etc methods. Doing so will use the supplied registry instead of the static REGISTRY.

tikv / fail-rs

support using failpoints from unit tests without requiring serial test execution #79