Interop test runner: Use exit code of dialer as source of truth

thomaseizinger commented 1 year ago

Relevant discussion: https://github.com/libp2p/test-plans/pull/114#discussion_r1085989316

Basic idea:

On success, the listener loops with a timeout (1min?) and waits to be terminated by docker compose
On success, the dialer exits with 0
On failure the dialer exits with 1
On failure, the listener exits with 1

This should simplify the following code snippets because we can use the exit code of the dialer as the source of truth:

https://github.com/libp2p/test-plans/blob/5c19f0876ffddab67c84f244556e633ac5a85fee/multidim-interop/src/compose-runner.ts#L45-L66

MarcoPolo commented 1 year ago

We should still check the ping duration so that if an implementation silently fails it doesn't get ignored. Much harder to accidentally fake a ping duration.

thomaseizinger commented 1 year ago

What do you mean silently fails? The test binary decides whether or not it is successful. If we start putting this logic into the test runner, things will get very messy as we add more tests.

If we expect the ping duration to be within a certain range, I'd suggest that the test itself asserts that and exits with 1 or 0 respectively. This is consistent with how other test frameworks & runners work.

It might get a bit tricky once we add a delay to the network communication but even then I think it would be better to pass that information into the test and derive a correct assertion from there rather than asserting things within the test runner.

marten-seemann commented 1 year ago

What do you mean silently fails? The test binary decides whether or not it is successful.

It's relatively easy to mess up the exit code, depending on your language. It would be really bad if an implementation is relying on the interop tests passing, just to notice that this was during an incorrectly set exit code. Checking the RTT would be more robust.

If we expect the ping duration to be within a certain range, I'd suggest that the test itself asserts that and exits with 1 or 0 respectively. This is consistent with how other test frameworks & runners work.

That works (only!) for the ping RTT. For the handshake RTT test, the implementation would not how many RTTs to expect. In this case as well it would be more robust to do that check in the runner and not in the test itself. This is how the QUIC interop runner works by the way.

thomaseizinger commented 1 year ago

What do you mean silently fails? The test binary decides whether or not it is successful.

It's relatively easy to mess up the exit code, depending on your language. It would be really bad if an implementation is relying on the interop tests passing, just to notice that this was during an incorrectly set exit code. Checking the RTT would be more robust.

Really? A test with bad assertions is useless, I agree but I am not sure we should be making design decisions based on that. I'd rather try and push for making it super easy to write these tests, with minimal boilerplate such that it is easy to review and we can check these problems. Also, wouldn't it always need two implementations to be buggy? Assuming we ran all implementations in all combinations, it seems unlikely that one buggy one slips through because it is asserting the wrong thing AND has a bug.

If we expect the ping duration to be within a certain range, I'd suggest that the test itself asserts that and exits with 1 or 0 respectively. This is consistent with how other test frameworks & runners work.

That works (only!) for the ping RTT. For the handshake RTT test, the implementation would not how many RTTs to expect. In this case as well it would be more robust to do that check in the runner and not in the test itself. This is how the QUIC interop runner works by the way.

What we will have to work out is the scope of the tests and of the test runner. For a QUIC interop test runner, it makes perfect sense to know about RTT time because the entire scope of the test runner is about QUIC connections. Ping is one of many protocols in the libp2p space. I think in an ideal world, our test runner shouldn't know about individual protocols. Measuring and exposing handshake RTT makes sense to me because every test will be expected to do a handshake. I don't see every test to run the ping protocol.

MarcoPolo commented 1 year ago

Also, wouldn't it always need two implementations to be buggy?

No. Only the dialer needs to return a false positive (needs to return a bad exit code of 0).

This isn't hypothetical. Glen's original PR that introduced the js-libp2p tests had a similar subtle bug (https://github.com/libp2p/test-plans/pull/98/files#diff-0b6202f38d6093a4c51465493b638fc8e1acadac2384bd814968202bdd994b11). Can you find it?

This doesn't complicate the test at all. You simply have to print some data to stdout in json format. You can even manually build the JSON string for all I care. We shouldn't bike shed this. I'll write up what this test should do in a small doc tomorrow. Basically the ping protocol and print two results in json format ping RTT and total RTT from before connect to after ping.

What we will have to work out is the scope of the tests and of the test runner.

As little as possible. The only reason I want to put RTT in here is because it's a superset of the ping test. Any other kind of test, will be a separate thing.

I think in an ideal world, our test runner shouldn't know about individual protocols.

Ping is often the first protocol an implementation will support and test. It's easy to implement and easy to understand. It adds one round trip after connnection setup, so it makes calculating the handshake RTT easy.

Let's try not to bikeshed this please. It's really not a problem to read the stdout from the dialer and make sure the ping result is reasonable as well as checking the exit code. And if the exit code is non-zero, then that's a straightforward failure (the current code should be changed).

thomaseizinger commented 1 year ago

This doesn't complicate the test at all.

I am not too worried about the complexity of the test but of the test runner. Unless every test will run the ping protocol, how will the test runner know what to scan the output for? Unless I am missing something, it is the test runner's job to compute the matrix of tests binaries to be run and start them up but it is ignorant over which particular test it is running.

I am happy to stick to one exception for ping but what I am worried about is the following ending up in the test runner:

if (testname == "ping) {
    // scan output for ping duration
}

if (testname == "kademlia-put") {
    // scan output for something kademlia specific
}

// etc

thomaseizinger commented 1 year ago

The above assumes that we will extend these interop tests by building multiple different binaries and containers, each one with a particular test goal in mind and we always just feed a list of containers to the test runner to execute. Side note: That is also where this idea came from. As soon as we write another test, it can no longer just be versions.ts but the GitHub action needs to configure, which containers should be started.

Perhaps I am making a wrong assumption somewhere here and you are planning a different approach of extending these tests?

MarcoPolo commented 1 year ago

Maybe there's a misunderstanding?

I'm explicitly not trying to make this a generic runner for every test we want. The RTT counter is a special case because it's a superset of the ping test.

The kademlia interop test is different because we don't care what transport/muxer/secure channel that runs on. Therefore it should simply be a different test. It'll be easier to understand and iterate on, even if we duplicate a little bit of code.

thomaseizinger commented 1 year ago

Definitely a misunderstanding then, I thought we were going to use this harness for more tests!

There is probably still a way to do that but we can design that with the next test :)

libp2p / test-plans

Interop test runner: Use exit code of dialer as source of truth #115