strideynet commented 2 years ago

Our integration tests are occasionally taking longer than 25 minutes, which causes them to timeout in GCB. This is a detriment to developer productivity.

strideynet commented 2 years ago

Blockers to running tests in parallel

Port selection

If there are multiple Teleport instances running in parallel, we need to ensure that they all expose themselves on unique ports.

The cleanest way of us doing this is to configure them to listen on :0 and then extract the chosen port that the service is actually running on. TeleportProcess holds a registry of listeners, which makes this easier, and these can be extracted using the helpers e.g tp.AuthSSHAddr() which returns a *utils.NetAddr with the actual port that it is listening on. However, this does mean we need to be more careful with starting things that depend on one another in the right order, the way things currently are, all ports are known ahead of time so there may be some issues retrofitting this.

The other option is potentially to write a findFreePort() style function and use that when populating the port selection ahead of time. This isn't really a fit solution, because there's a period between that function being called, and the new listener starting, in which the port could be taken by another listener. This would make the tests extremely flaky.

Suite code (`s.log`) not concurrency safe

The suite binding code we have is not concurrency safe on the s.log field. If one test in the suite completes before the others, the value of s.log will become nil and cause panics.

I need to investigate if there is an ulterior motive here, but I see two options:

Set s.log in newKubeSuite, and share this one logger between all tests in the suite
Create a logger in bind and pass it into each test as a parameter separate from the suite.

func (s *KubeSuite) bind(test kubeIntegrationTest) func(t *testing.T) {
    return func(t *testing.T) {
        s.log = utils.NewLoggerForTests()
        os.RemoveAll(profile.FullProfilePath(""))
        t.Cleanup(func() { s.log = nil })
        test(t, s)
    }
}

func TestKube(t *testing.T) {
    suite := newKubeSuite(t)
    t.Run("Exec", suite.bind(testKubeExec))
    t.Run("Deny", suite.bind(testKubeDeny))
    t.Run("PortForward", suite.bind(testKubePortForward))
    t.Run("TransportProtocol", suite.bind(testKubeTransportProtocol))
    t.Run("TrustedClustersClientCert", suite.bind(testKubeTrustedClustersClientCert))
    t.Run("TrustedClustersSNI", suite.bind(testKubeTrustedClustersSNI))
    t.Run("Disconnect", suite.bind(testKubeDisconnect))
    t.Run("Join", suite.bind(testKubeJoin))
    t.Run("ConnectionLimit", suite.bind(testKubeConnectionLimit))
}

`t.Setenv` not compatible with `t.Parallel()`

t.Setenv cannot be used in parallel tests, however, the following tests use it:

testCustomReverseTunnel
testTwoClustersTunnel
testTwoClustersProxy
TestALPNSNIHTTPSProxy

For now, we can just ensure that these tests are not run in parallel, but we should eventually allow the configuration options to be provided programatically:

TELEPORT_TUNNEL_PUBLIC_ADDR
http_proxy
no_proxy

`lib.SetInsecureDevMode()` not concurrency safe

At least 21 tests make use of lib.SetInsecureDevMode(true). If a test completes, and it's defer to reset this is executed, then there is a window where the mode will be disabled and this could cause setup of another test to fail.

func TestALPNProxyAuthClientConnectWithUserIdentity(t *testing.T) {
    lib.SetInsecureDevMode(true)
    defer lib.SetInsecureDevMode(false)

We could potentially set this globally in the test suite in init() or TestMain assuming that no tests rely on InsecureDevMode being disabled.

`SetTestTimeouts` is not concurrency safe

8913

The SetTestTimeouts helper includes the following explicit warning:

// NOTE: This function modifies global values for timeouts, etc. If your tests
// call this function, they MUST NOT BE RUN IN PARALLEL, as they may stomp on
// other tests.

Some tests share the teleport home directory

t.Setenv(teleport.HomeEnvVar, t.TempDir())

Would usually fix this, but as noted previously, we can't use environment variables with parallel. We may need to allow this value to be configured programatically.

strideynet commented 2 years ago

The HSM integration tests look simpler, and more approachable, I would like to raise a PR first that changes the HSM integration tests to use the new :0 listener style as a proof of concept, even though this is unlikely to result in a significant speed up of the entire test suite as the HSM tests run in parallel to the main suite as they are in different packages.

zmb3 commented 2 years ago

Sounds like a good idea to me @strideynet. Thanks for the great summary!

tcsc commented 2 years ago

Thanks for having a look at this. It's been permanently on my list to things to fix. A few notes and/or things to watch out for:

SetTestTimeouts() is my fault, at least somewhat. This was always intended to be a stopgap measure to appease the race checker until we could do the correct thing, which is move those globals into the appropriate config structs and have the services use them those instead. In at least some cases, those config fields do already exist, the globals are just used as the default value.
Port selection is a trickier problem than you might imagine. There are a few subproblems:
1. As you mention, you need some way of extracting the concrete, bound port(s) once the service starts. Not particularly hard, just work that needs doing
2. Some tests require that you configure multiple services on the same port, which you obviously can't do by specifying :0 for everything. This is why the whole findFreePort() structure exists, even though the fact it actually works most of the time is more on luck than science. I did some experiments on a port allocation fixture that created a listener on :0 with SO_REUSEADDR (from memory, although it may have been SO_REUSEPORT) set, and then using the port that it actually bound to as the configured port for the Teleport test. In order to make this work we'd have to make Teleport (perhaps optionally) set the same flags to allow it to bind to the same ports when it starts up. Pretty much anything else I can think of results in a race condition on the available ports.
3. It's possible that these on the same port tests have been obsoleted by newer routing modes in Teleport. Worth checking out, because then most of these problems just go away.
The current situation exists, I think, because in the past we used a custom, per-test logger to hide or show the test output depending on whether the test passed or not. This was partially due to the way GoCheck handles output, and partially because parts of the code didn't always honour the logger it was given and just grabbed the standard logger at will instead. Since then,
1. we have moved tests to use the stdlib testing package rather than GoCheck,
2. @zmb3 (I think) has regularised a lot of the use of the configured logger, and
3. we have shifted to using a post-process to decide what test output to show, and so the whole notion of a specific logger for each test is now moot and can probably be ripped out wholesale.
I've never been 100% happy with the way that the test fixture works (the whole bind() thing). The main design influence was me trying to change as few things as possible during a colossal refactor. GoCheck imposed a very OO style of test fixture, (e.g. setup/teardown methods, each test a method on the suite, etc), and so we ended up with something that was logically very similar, even if slightly more idiomatic Go. Do not feel beholden to that design - it was only ever supposed to be a baby step towards something better.
SetInsecureDevMode() should go the same way as the test timeout parameters in SetTestTimeouts() - moved to config structs and the global flag deleted entirely.
Any test that touches a live ~/.tsh directory is, in my opinion, a bug that should be squashed rather than worked around.
Another, more radical approach to integration testing might be to shift from testing teleport as a library to running fully-fledged teleport processes from a compiled binary, which would get around a bunch of these issues (like fiddling environment vars, etc). Take a look at how the teleport/plugins integration tests work for inspiration. (Which is not to say that things like points 1 & 5 shouldn't be fixed as well, but it's an option).

tcsc commented 2 years ago

Apologies for the giant essay. Let's just say I have opinions on this :sweat_smile:.

strideynet commented 2 years ago

Hah, that's perfectly fine, and ideal to be honest. It's for the best that we got these thoughts about the integration tests off our chests (and formally written down somewhere) so we can start thinking about what we want to move towards.

I'll have a more detailed read tomorrow. I have a feeling I've somewhat stumbled into where the dragons lay with this, but I think formalising a strategy going forward for dealing with these tests will reap benefits for us all.

gravitational / teleport

Run integration tests in parallel #12421

Blockers to running tests in parallel

Port selection

Suite code (`s.log`) not concurrency safe

`t.Setenv` not compatible with `t.Parallel()`

`lib.SetInsecureDevMode()` not concurrency safe

`SetTestTimeouts` is not concurrency safe

8913

Some tests share the teleport home directory

gravitational / teleport

Run integration tests in parallel #12421

Blockers to running tests in parallel

Port selection

Suite code (s.log) not concurrency safe

t.Setenv not compatible with t.Parallel()

lib.SetInsecureDevMode() not concurrency safe

SetTestTimeouts is not concurrency safe

8913

Some tests share the teleport home directory

Suite code (`s.log`) not concurrency safe

`t.Setenv` not compatible with `t.Parallel()`

`lib.SetInsecureDevMode()` not concurrency safe

`SetTestTimeouts` is not concurrency safe