Open strideynet opened 2 years ago
If there are multiple Teleport instances running in parallel, we need to ensure that they all expose themselves on unique ports.
The cleanest way of us doing this is to configure them to listen on :0
and then extract the chosen port that the service is actually running on. TeleportProcess
holds a registry of listeners, which makes this easier, and these can be extracted using the helpers e.g tp.AuthSSHAddr()
which returns a *utils.NetAddr
with the actual port that it is listening on. However, this does mean we need to be more careful with starting things that depend on one another in the right order, the way things currently are, all ports are known ahead of time so there may be some issues retrofitting this.
The other option is potentially to write a findFreePort()
style function and use that when populating the port selection ahead of time. This isn't really a fit solution, because there's a period between that function being called, and the new listener starting, in which the port could be taken by another listener. This would make the tests extremely flaky.
s.log
) not concurrency safeThe suite binding code we have is not concurrency safe on the s.log
field. If one test in the suite completes before the others, the value of s.log
will become nil and cause panics.
I need to investigate if there is an ulterior motive here, but I see two options:
newKubeSuite
, and share this one logger between all tests in the suitebind
and pass it into each test as a parameter separate from the suite.func (s *KubeSuite) bind(test kubeIntegrationTest) func(t *testing.T) {
return func(t *testing.T) {
s.log = utils.NewLoggerForTests()
os.RemoveAll(profile.FullProfilePath(""))
t.Cleanup(func() { s.log = nil })
test(t, s)
}
}
func TestKube(t *testing.T) {
suite := newKubeSuite(t)
t.Run("Exec", suite.bind(testKubeExec))
t.Run("Deny", suite.bind(testKubeDeny))
t.Run("PortForward", suite.bind(testKubePortForward))
t.Run("TransportProtocol", suite.bind(testKubeTransportProtocol))
t.Run("TrustedClustersClientCert", suite.bind(testKubeTrustedClustersClientCert))
t.Run("TrustedClustersSNI", suite.bind(testKubeTrustedClustersSNI))
t.Run("Disconnect", suite.bind(testKubeDisconnect))
t.Run("Join", suite.bind(testKubeJoin))
t.Run("ConnectionLimit", suite.bind(testKubeConnectionLimit))
}
t.Setenv
not compatible with t.Parallel()
t.Setenv
cannot be used in parallel tests, however, the following tests use it:
testCustomReverseTunnel
testTwoClustersTunnel
testTwoClustersProxy
TestALPNSNIHTTPSProxy
For now, we can just ensure that these tests are not run in parallel, but we should eventually allow the configuration options to be provided programatically:
lib.SetInsecureDevMode()
not concurrency safeAt least 21 tests make use of lib.SetInsecureDevMode(true)
. If a test completes, and it's defer to reset this is executed, then there is a window where the mode will be disabled and this could cause setup of another test to fail.
func TestALPNProxyAuthClientConnectWithUserIdentity(t *testing.T) {
lib.SetInsecureDevMode(true)
defer lib.SetInsecureDevMode(false)
We could potentially set this globally in the test suite in init()
or TestMain
assuming that no tests rely on InsecureDevMode being disabled.
SetTestTimeouts
is not concurrency safeThe SetTestTimeouts
helper includes the following explicit warning:
// NOTE: This function modifies global values for timeouts, etc. If your tests
// call this function, they MUST NOT BE RUN IN PARALLEL, as they may stomp on
// other tests.
t.Setenv(teleport.HomeEnvVar, t.TempDir())
Would usually fix this, but as noted previously, we can't use environment variables with parallel. We may need to allow this value to be configured programatically.
The HSM integration tests look simpler, and more approachable, I would like to raise a PR first that changes the HSM integration tests to use the new :0
listener style as a proof of concept, even though this is unlikely to result in a significant speed up of the entire test suite as the HSM tests run in parallel to the main suite as they are in different packages.
Sounds like a good idea to me @strideynet. Thanks for the great summary!
Thanks for having a look at this. It's been permanently on my list to things to fix. A few notes and/or things to watch out for:
SetTestTimeouts() is my fault, at least somewhat. This was always intended to be a stopgap measure to appease the race checker until we could do the correct thing, which is move those globals into the appropriate config structs and have the services use them those instead. In at least some cases, those config fields do already exist, the globals are just used as the default value.
Port selection is a trickier problem than you might imagine. There are a few subproblems:
:0
for everything. This is why the whole findFreePort()
structure exists, even though the fact it actually works most of the time is more on luck than science. I did some experiments on a port allocation fixture that created a listener on :0
with SO_REUSEADDR
(from memory, although it may have been SO_REUSEPORT
) set, and then using the port that it actually bound to as the configured port for the Teleport test. In order to make this work we'd have to make Teleport (perhaps optionally) set the same flags to allow it to bind to the same ports when it starts up. Pretty much anything else I can think of results in a race condition on the available ports.The current situation exists, I think, because in the past we used a custom, per-test logger to hide or show the test output depending on whether the test passed or not. This was partially due to the way GoCheck handles output, and partially because parts of the code didn't always honour the logger it was given and just grabbed the standard logger at will instead. Since then,
testing
package rather than GoCheck,I've never been 100% happy with the way that the test fixture works (the whole bind()
thing). The main design influence was me trying to change as few things as possible during a colossal refactor. GoCheck imposed a very OO style of test fixture, (e.g. setup/teardown methods, each test a method on the suite, etc), and so we ended up with something that was logically very similar, even if slightly more idiomatic Go. Do not feel beholden to that design - it was only ever supposed to be a baby step towards something better.
SetInsecureDevMode()
should go the same way as the test timeout parameters in SetTestTimeouts()
- moved to config structs and the global flag deleted entirely.
Any test that touches a live ~/.tsh
directory is, in my opinion, a bug that should be squashed rather than worked around.
Another, more radical approach to integration testing might be to shift from testing teleport as a library to running fully-fledged teleport processes from a compiled binary, which would get around a bunch of these issues (like fiddling environment vars, etc). Take a look at how the teleport/plugins
integration tests work for inspiration. (Which is not to say that things like points 1 & 5 shouldn't be fixed as well, but it's an option).
Apologies for the giant essay. Let's just say I have opinions on this :sweat_smile:.
Hah, that's perfectly fine, and ideal to be honest. It's for the best that we got these thoughts about the integration tests off our chests (and formally written down somewhere) so we can start thinking about what we want to move towards.
I'll have a more detailed read tomorrow. I have a feeling I've somewhat stumbled into where the dragons lay with this, but I think formalising a strategy going forward for dealing with these tests will reap benefits for us all.
Our integration tests are occasionally taking longer than 25 minutes, which causes them to timeout in GCB. This is a detriment to developer productivity.