gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.67k stars 1.77k forks source link

`TestIntegrations/Discovery` flakiness #26663

Open ravicious opened 1 year ago

ravicious commented 1 year ago

Failure

Link(s) to logs

Relevant snippet

    integration_test.go:3900: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:3900
                                        /__w/teleport/teleport/integration/integration_test.go:114
            Error:          Received unexpected error:
                            failed connecting to host 127.0.0.1:39505: failed to dial target host
                                unable to fetch access point, this proxy clusterPeer(TunnelConnection(name=2f33b91b-5b99-43e7-b35e-37c32bcfec48-cluster-remote, type=, cluster=cluster-remote, proxy=2f33b91b-5b99-43e7-b35e-37c32bcfec48)) has not been discovered yet, try again later
            Test:           TestIntegrations/Discovery
     integration_test.go:3664: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:3664
                                        /__w/teleport/teleport/integration/integration_test.go:109
            Error:          Received unexpected error:
                            timed out, only 0/2 events received. received: [], expected: [ProxyReverseTunnelReady ProxySSHReady]
            Test:           TestIntegrations/Discovery
zmb3 commented 1 year ago

This just blocked a v13 release: https://github.com/gravitational/teleport/actions/runs/5903184778/job/16012575863

    integration_test.go:4018: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:4018
                                        /__w/teleport/teleport/integration/integration_test.go:119
            Error:          Received unexpected error:
                            failed connecting to host 127.0.0.1:44391: failed to dial target host
                                unable to fetch access point, this proxy clusterPeer(TunnelConnection(name=19978ea4-1790-45e8-bb71-af0b41581184-cluster-remote, type=, cluster=cluster-remote, proxy=19978ea4-1790-45e8-bb71-af0b41581184)) has not been discovered yet, try again later
            Test:           TestIntegrations/Discovery
zmb3 commented 1 year ago

Another one on master: https://github.com/gravitational/teleport/actions/runs/6027048272/job/16351354131

    integration_test.go:4044: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:4044
                                        /__w/teleport/teleport/integration/integration_test.go:125
            Error:          Received unexpected error:
                            failed connecting to host 127.0.0.1:41815: failed to receive cluster details response
                                failed to dial target host
                                unable to fetch access point, this proxy clusterPeer(TunnelConnection(name=2f6a37a0-959d-474b-9a9b-325beb70a9fc-cluster-remote, type=, cluster=cluster-remote, proxy=2f6a37a0-959d-474b-9a9b-325beb70a9fc)) has not been discovered yet, try again later
            Test:           TestIntegrations/Discovery
zmb3 commented 1 year ago

v14: https://github.com/gravitational/teleport/actions/runs/6087350096/job/16515717921

GavinFrazar commented 1 year ago

different failure on master:

https://github.com/gravitational/teleport/actions/runs/6227775190/job/16903131172?pr=31916

    integration_test.go:4048: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:4052
                                        /opt/go/src/runtime/asm_amd64.s:1650
            Error:          Expected value not to be nil.
    integration_test.go:4048: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:4048
                                        /__w/teleport/teleport/integration/integration_test.go:123
            Error:          Condition never satisfied
            Test:           TestIntegrations/Discovery
nklaassen commented 1 year ago

I got the same failure as Gavin on v14 https://github.com/gravitational/teleport/actions/runs/6422952356/job/17440476992

zmb3 commented 1 year ago

https://github.com/gravitational/teleport/actions/runs/6565574843/job/17834398083

    integration_test.go:4187: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:6166
                                        /opt/go/src/runtime/asm_amd64.s:1650
            Error:          Expected value not to be nil.
    integration_test.go:4187: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:6162
                                        /__w/teleport/teleport/integration/integration_test.go:4187
                                        /__w/teleport/teleport/integration/integration_test.go:123
            Error:          Condition never satisfied
            Test:           TestIntegrations/Discovery
rosstimothy commented 1 year ago

https://github.com/gravitational/teleport/actions/runs/6579308041/job/17874902654

    integration_test.go:4187: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:6166
                                        /opt/go/src/runtime/asm_amd64.s:1650
            Error:          Expected value not to be nil.
    integration_test.go:4187: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:6162
                                        /__w/teleport/teleport/integration/integration_test.go:4187
                                        /__w/teleport/teleport/integration/integration_test.go:123
            Error:          Condition never satisfied
            Test:           TestIntegrations/Discovery
zmb3 commented 1 year ago

https://github.com/gravitational/teleport/actions/runs/6615009795/job/17966244734

ptgott commented 1 year ago

https://github.com/gravitational/teleport/actions/runs/6829295469/job/18575152180#step:6:2694

rosstimothy commented 1 year ago

branch/v14:

integration_test.go:4267: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:6248
                                        /opt/go/src/runtime/asm_amd64.s:1650
            Error:          Expected value not to be nil.
    integration_test.go:4267: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:6244
                                        /__w/teleport/teleport/integration/integration_test.go:4267
                                        /__w/teleport/teleport/integration/integration_test.go:126
            Error:          Condition never satisfied
            Test:           TestIntegrations/Discovery
ravicious commented 1 year ago

I ran into a new kind of error on branch/v14:

https://github.com/gravitational/teleport/actions/runs/6960808427/job/18941067310#step:6:2122

    integration_test.go:4267: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:6254
                                        /opt/go/src/runtime/asm_amd64.s:1650
            Error:          Received unexpected error:
                            connection error: desc = "transport: Error while dialing: failed to dial: cluster cluster-remote is offline"
    integration_test.go:4267: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:6244
                                        /__w/teleport/teleport/integration/integration_test.go:4267
                                        /__w/teleport/teleport/integration/integration_test.go:126
            Error:          Condition never satisfied
            Test:           TestIntegrations/Discovery
rosstimothy commented 12 months ago

Failure from a PR rebased with the tip of master

=== FAIL: integration TestIntegrations/Discovery (20.69s)
    integration_test.go:4144: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:4144
                                        /__w/teleport/teleport/integration/integration_test.go:124
            Error:          Received unexpected error:
                            failed connecting to host 127.0.0.1:45901: failed to receive cluster details response
                                failed to dial target host
                                direct dialing to nodes not found in inventory is not supported
            Test:           TestIntegrations/Discovery
    --- FAIL: TestIntegrations/Discovery (20.69s)

=== FAIL: integration TestIntegrations (719.38s)

https://github.com/gravitational/teleport/actions/runs/7020003123/job/19098756968?pr=34457

ravicious commented 12 months ago

Error on a branch based on branch/v14, it includes the backport of the recent fix (#34996).

https://github.com/gravitational/teleport/actions/runs/7031231504/job/19132398108#step:6:3561

    integration_test.go:4233: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:4233
                                        /__w/teleport/teleport/integration/integration_test.go:126
            Error:          Received unexpected error:
                            failed connecting to host 127.0.0.1:34309: failed to receive cluster details response
                                failed to dial target host
                                unable to fetch node watcher, this proxy clusterPeer(TunnelConnection(name=00000000-0000-0000-0000-000000000000-cluster-remote, type=, cluster=cluster-remote, proxy=00000000-0000-0000-0000-000000000000)) has not been discovered yet, try again later
            Test:           TestIntegrations/Discovery
nklaassen commented 11 months ago

Another in the merge queue to master https://github.com/gravitational/teleport/actions/runs/7090182023/job/19296583245

=== RUN   TestIntegrations/Discovery
    integration_test.go:4172: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:6155
                                        /opt/go/src/runtime/asm_amd64.s:1650
            Error:          Expected value not to be nil.
    integration_test.go:4172: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:6151
                                        /__w/teleport/teleport/integration/integration_test.go:4172
                                        /__w/teleport/teleport/integration/integration_test.go:126
            Error:          Condition never satisfied
            Test:           TestIntegrations/Discovery
    --- FAIL: TestIntegrations/Discovery (50.00s)
zmb3 commented 11 months ago

https://github.com/gravitational/teleport/actions/runs/7093375371/job/19306644012

zmb3 commented 11 months ago

https://github.com/gravitational/teleport/actions/runs/7115810275/job/19372821232

ravicious commented 11 months ago

https://github.com/gravitational/teleport/actions/runs/7115810275/job/19372821232#step:6:1122

    integration_test.go:4146: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:4146
                                        /__w/teleport/teleport/integration/integration_test.go:126
            Error:          Received unexpected error:
                            failed connecting to host 127.0.0.1:34889: failed to receive cluster details response
                                failed to dial target host
                                direct dialing to nodes not found in inventory is not supported
            Test:           TestIntegrations/Discovery
ravicious commented 11 months ago

v13

https://github.com/gravitational/teleport/actions/runs/7127815036/job/19408522862#step:6:1909

     integration_test.go:4121: 
            Error Trace:    /__w/teleport/teleport/integration/integration_test.go:4121
                                        /__w/teleport/teleport/integration/integration_test.go:123
            Error:          Received unexpected error:
                            failed connecting to host 127.0.0.1:36773: failed to receive cluster details response
                                failed to dial target host
                                unable to fetch node watcher, this proxy clusterPeer(TunnelConnection(name=00000000-0000-0000-0000-000000000000-cluster-remote, type=, cluster=cluster-remote, proxy=00000000-0000-0000-0000-000000000000)) has not been discovered yet, try again later
            Test:           TestIntegrations/Discovery
zmb3 commented 10 months ago

This is an issue again. Fresh hits on master:

=== Failed
=== FAIL: integration TestIntegrations/Discovery (47.02s)
    trustedclusters.go:148: 
            Error Trace:    /__w/teleport/teleport/integration/helpers/trustedclusters.go:159
                                        /opt/go/src/runtime/asm_amd64.s:1650
            Error:          Received unexpected error:
                            unable to fetch client, this proxy clusterPeer(TunnelConnection(name=63cc50e2-09e3-4d28-81b2-31e0542e0311-cluster-remote, type=, cluster=cluster-remote, proxy=63cc50e2-09e3-4d28-81b2-31e0542e0311)) has not been discovered yet, try again later
            Messages:       cluster not yet available
    trustedclusters.go:148: 
            Error Trace:    /__w/teleport/teleport/integration/helpers/trustedclusters.go:148
                                        /__w/teleport/teleport/integration/integration_test.go:4172
                                        /__w/teleport/teleport/integration/integration_test.go:126
            Error:          Condition never satisfied
            Test:           TestIntegrations/Discovery
            Messages:       Active tunnel connections did not reach 1 in the expected time frame 30s
    --- FAIL: TestIntegrations/Discovery (47.02s)

=== FAIL: integration TestIntegrations (763.62s)
zmb3 commented 9 months ago

Another on master, which is preventing us from getting an important build fix out.

https://github.com/gravitational/teleport/actions/runs/7718393459/job/21039553375?pr=37568

fheinecke commented 9 months ago

This failed repeatedly for the v15.0.2 release, delaying it by a couple of hours

v14 release as well: https://github.com/gravitational/teleport/actions/runs/7937403150/job/21674512929

espadolini commented 9 months ago

I'll prioritize fixing this next week.

zmb3 commented 9 months ago

grazie mille @espadolini 🙏

espadolini commented 9 months ago

Extending the timeout in WaitForActiveTunnelConnections and logging some timestamps revealed that it can take longer than 30 seconds for tunnels to appear, depending on how unlucky we get with how many attempts the remote cluster agentpool takes before the expected reverse tunnel server is available and how unlucky we get while it's available.

This is intended behavior, since we want reconnection attempts to gradually back off after a small initial burst - however, it's very undesirable in tests.

With a very hacky change (3a51a830a0dc0cd3e6ad816066f8048b33be6876) I haven't seen a WaitForActiveTunnelConnections take more than 0.5 seconds(!), but the cpu usage during the test can spike quite a bit (I've seen agent lease IDs in the 50+).

I'm not entirely sure of what's the least worse way to make the agent pools more "aggressive" in our tests - and I'm not even sure if we truly want to, in integration tests - perhaps we should change Teleport itself to reconnect slightly more often even in actual builds, at least for the remote cluster agent pool?