gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.23k stars 1.73k forks source link

`TestKube/EphemeralContainers` flakiness #40969

Closed r0mant closed 2 months ago

r0mant commented 4 months ago

Failed in the merge queue: https://github.com/gravitational/teleport/actions/runs/8854528809/job/24317760696.

    kube_integration_test.go:1539: 
            Error Trace:    /__w/teleport/teleport/integration/kube_integration_test.go:1539
                                        /__w/teleport/teleport/integration/kube_integration_test.go:168
            Error:          Received unexpected error:
                            pods "test-pod" is forbidden: User "ci" cannot create resource "pods/attach" in API group "" in the namespace "teletest"
            Test:           TestKube/EphemeralContainers
rosstimothy commented 3 months ago

https://github.com/gravitational/teleport/actions/runs/8929622159/job/24527872383

 kube_integration_test.go:1539: 
            Error Trace:    /__w/teleport/teleport/integration/kube_integration_test.go:1539
                                        /__w/teleport/teleport/integration/kube_integration_test.go:168
            Error:          Received unexpected error:
                            Internal error occurred: error executing command in container: unable to upgrade connection: container ephemeral-container not found in pod test-pod_teletest
            Test:           TestKube/EphemeralContainers
rosstimothy commented 3 months ago

Another hit:

https://github.com/gravitational/teleport/actions/runs/8975349618/job/24649749867

    kube_integration_test.go:1539: 
            Error Trace:    /__w/teleport/teleport/integration/kube_integration_test.go:1539
                                        /__w/teleport/teleport/integration/kube_integration_test.go:168
            Error:          Received unexpected error:
                            Internal error occurred: error executing command in container: unable to upgrade connection: container ephemeral-container not found in pod test-pod_teletest
            Test:           TestKube/EphemeralContainers
GavinFrazar commented 3 months ago

https://github.com/gravitational/teleport/actions/runs/8993318798/job/24704894991?pr=41129

            Error Trace:    /__w/teleport/teleport/integration/kube_integration_test.go:1539
                                        /__w/teleport/teleport/integration/kube_integration_test.go:168
            Error:          Received unexpected error:
                            Internal error occurred: error executing command in container: unable to upgrade connection: container ephemeral-container not found in pod test-pod_teletest
            Test:           TestKube/EphemeralContainers
GavinFrazar commented 3 months ago

https://github.com/gravitational/teleport/actions/runs/9012003224/job/24760488512?pr=41351

---
Connection closed: unable to upgrade connection: container ephemeral-container not found in pod test-pod_teletest
{"caller":"streamproto/proto.go:173","component":null,"error":"websocket: close sent","level":"warning","message":"Failed to read message from websocket","timestamp":"2024-05-09T03:44:41Z"}
    kube_integration_test.go:1539: 
            Error Trace:    /__w/teleport/teleport/integration/kube_integration_test.go:1539
                                        /__w/teleport/teleport/integration/kube_integration_test.go:168
            Error:          Received unexpected error:
                            Internal error occurred: error executing command in container: unable to upgrade connection: container ephemeral-container not found in pod test-pod_teletest
            Test:           TestKube/EphemeralContainers
rosstimothy commented 3 months ago

I've also hit a data race in this test

https://github.com/gravitational/teleport/actions/runs/9082679539/job/24959638676

==================
WARNING: DATA RACE
Write at 0x00c008997f68 by goroutine 76390:
  bufio.(*Writer).Write()
      /opt/go/src/bufio/bufio.go:695 +0x385
  golang.org/x/net/http2.(*responseWriter).write()
      /go/pkg/mod/golang.org/x/net@v0.24.0/http2/server.go:2980 +0x20a
  golang.org/x/net/http2.(*responseWriter).Write()
      /go/pkg/mod/golang.org/x/net@v0.24.0/http2/server.go:2954 +0x53
  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*respWriterWrapper).Write()
      /go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.49.0/wrap.go:81 +0xdb
  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*respWriterWrapper).Write-fm()
      <autogenerated>:1 +0x55
  github.com/felixge/httpsnoop.(*rw).Write()
      /go/pkg/mod/github.com/felixge/httpsnoop@v1.0.4/wrap_generated_gteq_1.8.go:380 +0x124
  go:(*struct { github.com/felixge/httpsnoop.Unwrapper; net/http.ResponseWriter; net/http.Flusher; net/http.CloseNotifier; net/http.Pusher }).Write()
      <autogenerated>:1 +0x74
  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*respWriterWrapper).Write()
      /go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.49.0/wrap.go:81 +0xdb
  go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*respWriterWrapper).Write-fm()
      <autogenerated>:1 +0x55
  github.com/felixge/httpsnoop.(*rw).Write()
      /go/pkg/mod/github.com/felixge/httpsnoop@v1.0.4/wrap_generated_gteq_1.8.go:380 +0x124
  go:(*struct { github.com/felixge/httpsnoop.Unwrapper; net/http.ResponseWriter; net/http.Flusher; net/http.CloseNotifier; net/http.Pusher }).Write()
      <autogenerated>:1 +0x74
  github.com/prometheus/client_golang/prometheus/promhttp.(*responseWriterDelegator).Write()
      /go/pkg/mod/github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/delegator.go:74 +0x85
  go:(*struct { *github.com/prometheus/client_golang/prometheus/promhttp.responseWriterDelegator; net/http.Pusher; net/http.Flusher; net/http.CloseNotifier }).Write()
      <autogenerated>:1 +0x64
  github.com/prometheus/client_golang/prometheus/promhttp.(*responseWriterDelegator).Write()
      /go/pkg/mod/github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/delegator.go:74 +0x85
  go:(*struct { *github.com/prometheus/client_golang/prometheus/promhttp.responseWriterDelegator; net/http.Pusher; net/http.Flusher; net/http.CloseNotifier }).Write()
      <autogenerated>:1 +0x64
  k8s.io/apimachinery/pkg/runtime/serializer/streaming.(*encoder).Encode()
      /go/pkg/mod/k8s.io/apimachinery@v0.29.0/pkg/runtime/serializer/streaming/streaming.go:133 +0x189
  k8s.io/client-go/rest/watch.(*Encoder).Encode()
      /go/pkg/mod/k8s.io/client-go@v0.29.0/rest/watch/encoder.go:52 +0x1ab
  github.com/gravitational/teleport/lib/kube/proxy/responsewriters.(*WatcherResponseWriter).watchDecoder.func1()
      /__w/teleport/teleport/lib/kube/proxy/responsewriters/watcher.go:252 +0xd9
  github.com/gravitational/teleport/lib/kube/proxy/responsewriters.(*WatcherResponseWriter).watchDecoder.func2()
      /__w/teleport/teleport/lib/kube/proxy/responsewriters/watcher.go:274 +0x121
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78 +0x97

Previous read at 0x00c008997f68 by goroutine 76388:
  bufio.(*Writer).Buffered()
      /opt/go/src/bufio/bufio.go:670 +0x7a
  golang.org/x/net/http2.(*responseWriter).FlushError()
      /go/pkg/mod/golang.org/x/net@v0.24.0/http2/server.go:2823 +0x4e
  golang.org/x/net/http2.(*responseWriter).Flush()
      /go/pkg/mod/golang.org/x/net@v0.24.0/http2/server.go:2[814](https://github.com/gravitational/teleport/actions/runs/9082679539/job/24959638676#step:8:815) +0x26
  net/http.Flusher.Flush-fm()
      <autogenerated>:1 +0x42
  github.com/felixge/httpsnoop.(*rw).Flush()
      /go/pkg/mod/github.com/felixge/httpsnoop@v1.0.4/wrap_generated_gteq_1.8.go:388 +0x10f
  go:(*struct { github.com/felixge/httpsnoop.Unwrapper; net/http.ResponseWriter; net/http.Flusher; net/http.CloseNotifier; net/http.Pusher }).Flush()
      <autogenerated>:1 +0x4b
  net/http.Flusher.Flush-fm()
      <autogenerated>:1 +0x42
  github.com/felixge/httpsnoop.(*rw).Flush()
      /go/pkg/mod/github.com/felixge/httpsnoop@v1.0.4/wrap_generated_gteq_1.8.go:388 +0x10f
  go:(*struct { github.com/felixge/httpsnoop.Unwrapper; net/http.ResponseWriter; net/http.Flusher; net/http.CloseNotifier; net/http.Pusher }).Flush()
      <autogenerated>:1 +0x4b
  github.com/prometheus/client_golang/prometheus/promhttp.flusherDelegator.Flush()
      /go/pkg/mod/github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/delegator.go:98 +0x76
  go:(*struct { *github.com/prometheus/client_golang/prometheus/promhttp.responseWriterDelegator; net/http.Pusher; net/http.Flusher; net/http.CloseNotifier }).Flush()
      <autogenerated>:1 +0x4b
  github.com/prometheus/client_golang/prometheus/promhttp.flusherDelegator.Flush()
      /go/pkg/mod/github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/delegator.go:98 +0x76
  go:(*struct { *github.com/prometheus/client_golang/prometheus/promhttp.responseWriterDelegator; net/http.Pusher; net/http.Flusher; net/http.CloseNotifier }).Flush()
      <autogenerated>:1 +0x4b
  github.com/gravitational/teleport/lib/kube/proxy/responsewriters.(*WatcherResponseWriter).Flush()
      /__w/teleport/teleport/lib/kube/proxy/responsewriters/watcher.go:200 +0x52
  net/http.(*ResponseController).Flush()
      /opt/go/src/net/http/responsecontroller.go:54 +0xef
  net/http.(*ResponseController).Flush-fm()
      <autogenerated>:1 +0x33
  net/http/httputil.(*maxLatencyWriter).delayedFlush()
      /opt/go/src/net/http/httputil/reverseproxy.go:706 +0xd5
  net/http/httputil.(*maxLatencyWriter).delayedFlush-fm()
      <autogenerated>:1 +0x33

Goroutine 76390 (running) created at:
  golang.org/x/sync/errgroup.(*Group).Go()
      /go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:75 +0x124
  github.com/gravitational/teleport/lib/kube/proxy/responsewriters.(*WatcherResponseWriter).watchDecoder()
      /__w/teleport/teleport/lib/kube/proxy/responsewriters/watcher.go:270 +0x9b0
  github.com/gravitational/teleport/lib/kube/proxy/responsewriters.(*WatcherResponseWriter).WriteHeader.func1()
      /__w/teleport/teleport/lib/kube/proxy/responsewriters/watcher.go:162 +0x184
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78 +0x97

Goroutine 76388 (finished) created at:
  time.goFunc()
      /opt/go/src/time/sleep.go:176 +0x44
==================
==================
WARNING: DATA RACE
Read at 0x00c004eb0ed1 by goroutine 76390:
  golang.org/x/net/http2.(*responseWriterState).writeChunk()
      /go/pkg/mod/golang.org/x/net@v0.24.0/http2/server.go:2615 +0x27e
  golang.org/x/net/http2.chunkWriter.Write()
      /go/pkg/mod/golang.org/x/net@v0.24.0/http2/server.go:2564 +0x48
  bufio.(*Writer).Flush()
      /opt/go/src/bufio/bufio.go:642 +0xf0
  golang.org/x/net/http2.(*responseWriter).FlushError()
      /go/pkg/mod/golang.org/x/net@v0.24.0/http2/server.go:2[824](https://github.com/gravitational/teleport/actions/runs/9082679539/job/24959638676#step:8:825) +0x9e
  golang.org/x/net/http2.(*responseWriter).Flush()
      /go/pkg/mod/golang.org/x/net@v0.24.0/http2/server.go:2814 +0x26
  net/http.Flusher.Flush-fm()
      <autogenerated>:1 +0x42
  github.com/felixge/httpsnoop.(*rw).Flush()
      /go/pkg/mod/github.com/felixge/httpsnoop@v1.0.4/wrap_generated_gteq_1.8.go:388 +0x10f
  go:(*struct { github.com/felixge/httpsnoop.Unwrapper; net/http.ResponseWriter; net/http.Flusher; net/http.CloseNotifier; net/http.Pusher }).Flush()
      <autogenerated>:1 +0x4b
  net/http.Flusher.Flush-fm()
      <autogenerated>:1 +0x42
  github.com/felixge/httpsnoop.(*rw).Flush()
      /go/pkg/mod/github.com/felixge/httpsnoop@v1.0.4/wrap_generated_gteq_1.8.go:388 +0x10f
  go:(*struct { github.com/felixge/httpsnoop.Unwrapper; net/http.ResponseWriter; net/http.Flusher; net/http.CloseNotifier; net/http.Pusher }).Flush()
      <autogenerated>:1 +0x4b
  github.com/prometheus/client_golang/prometheus/promhttp.flusherDelegator.Flush()
      /go/pkg/mod/github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/delegator.go:98 +0x76
  go:(*struct { *github.com/prometheus/client_golang/prometheus/promhttp.responseWriterDelegator; net/http.Pusher; net/http.Flusher; net/http.CloseNotifier }).Flush()
      <autogenerated>:1 +0x4b
  github.com/prometheus/client_golang/prometheus/promhttp.flusherDelegator.Flush()
      /go/pkg/mod/github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/delegator.go:98 +0x76
  go:(*struct { *github.com/prometheus/client_golang/prometheus/promhttp.responseWriterDelegator; net/http.Pusher; net/http.Flusher; net/http.CloseNotifier }).Flush()
      <autogenerated>:1 +0x4b
  github.com/gravitational/teleport/lib/kube/proxy/responsewriters.(*WatcherResponseWriter).Flush()
      /__w/teleport/teleport/lib/kube/proxy/responsewriters/watcher.go:200 +0x112
  github.com/gravitational/teleport/lib/kube/proxy/responsewriters.(*WatcherResponseWriter).watchDecoder.func1()
      /__w/teleport/teleport/lib/kube/proxy/responsewriters/watcher.go:266 +0xe6
  github.com/gravitational/teleport/lib/kube/proxy/responsewriters.(*WatcherResponseWriter).watchDecoder.func2()
      /__w/teleport/teleport/lib/kube/proxy/responsewriters/watcher.go:274 +0x121
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78 +0x97

Previous write at 0x00c004eb0ed1 by goroutine 76388:
  golang.org/x/net/http2.(*responseWriterState).writeChunk()
      /go/pkg/mod/golang.org/x/net@v0.24.0/http2/server.go:2616 +0x29d
  golang.org/x/net/http2.chunkWriter.Write()
      /go/pkg/mod/golang.org/x/net@v0.24.0/http2/server.go:2564 +0xba
  golang.org/x/net/http2.(*responseWriter).FlushError()
      /go/pkg/mod/golang.org/x/net@v0.24.0/http2/server.go:2[830](https://github.com/gravitational/teleport/actions/runs/9082679539/job/24959638676#step:8:831) +0xc0
  golang.org/x/net/http2.(*responseWriter).Flush()
      /go/pkg/mod/golang.org/x/net@v0.24.0/http2/server.go:2814 +0x26
  net/http.Flusher.Flush-fm()
      <autogenerated>:1 +0x42
  github.com/felixge/httpsnoop.(*rw).Flush()
      /go/pkg/mod/github.com/felixge/httpsnoop@v1.0.4/wrap_generated_gteq_1.8.go:388 +0x10f
  go:(*struct { github.com/felixge/httpsnoop.Unwrapper; net/http.ResponseWriter; net/http.Flusher; net/http.CloseNotifier; net/http.Pusher }).Flush()
      <autogenerated>:1 +0x4b
  net/http.Flusher.Flush-fm()
      <autogenerated>:1 +0x42
  github.com/felixge/httpsnoop.(*rw).Flush()
      /go/pkg/mod/github.com/felixge/httpsnoop@v1.0.4/wrap_generated_gteq_1.8.go:388 +0x10f
  go:(*struct { github.com/felixge/httpsnoop.Unwrapper; net/http.ResponseWriter; net/http.Flusher; net/http.CloseNotifier; net/http.Pusher }).Flush()
      <autogenerated>:1 +0x4b
  github.com/prometheus/client_golang/prometheus/promhttp.flusherDelegator.Flush()
      /go/pkg/mod/github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/delegator.go:98 +0x76
  go:(*struct { *github.com/prometheus/client_golang/prometheus/promhttp.responseWriterDelegator; net/http.Pusher; net/http.Flusher; net/http.CloseNotifier }).Flush()
      <autogenerated>:1 +0x4b
  github.com/prometheus/client_golang/prometheus/promhttp.flusherDelegator.Flush()
      /go/pkg/mod/github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/delegator.go:98 +0x76
  go:(*struct { *github.com/prometheus/client_golang/prometheus/promhttp.responseWriterDelegator; net/http.Pusher; net/http.Flusher; net/http.CloseNotifier }).Flush()
      <autogenerated>:1 +0x4b
  github.com/gravitational/teleport/lib/kube/proxy/responsewriters.(*WatcherResponseWriter).Flush()
      /__w/teleport/teleport/lib/kube/proxy/responsewriters/watcher.go:200 +0x52
  net/http.(*ResponseController).Flush()
      /opt/go/src/net/http/responsecontroller.go:54 +0xef
  net/http.(*ResponseController).Flush-fm()
      <autogenerated>:1 +0x33
  net/http/httputil.(*maxLatencyWriter).delayedFlush()
      /opt/go/src/net/http/httputil/reverseproxy.go:706 +0xd5
  net/http/httputil.(*maxLatencyWriter).delayedFlush-fm()
      <autogenerated>:1 +0x33

Goroutine 76390 (running) created at:
  golang.org/x/sync/errgroup.(*Group).Go()
      /go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:75 +0x124
  github.com/gravitational/teleport/lib/kube/proxy/responsewriters.(*WatcherResponseWriter).watchDecoder()
      /__w/teleport/teleport/lib/kube/proxy/responsewriters/watcher.go:270 +0x9b0
  github.com/gravitational/teleport/lib/kube/proxy/responsewriters.(*WatcherResponseWriter).WriteHeader.func1()
      /__w/teleport/teleport/lib/kube/proxy/responsewriters/watcher.go:162 +0x184
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78 +0x97

Goroutine 76388 (finished) created at:
  time.goFunc()
      /opt/go/src/time/sleep.go:176 +0x44
==================
capnspacehook commented 3 months ago

I found and fixed the cause of the above race, but I've been running the test locally thousands of times with all CPU cores, 1 cpu core etc and still can't reproduce the unable to upgrade connection issue

GavinFrazar commented 3 months ago

hit today as well: https://github.com/gravitational/teleport/actions/runs/9100380912/job/25015101195

Creating ephemeral container ephemeral-container in pod teletest/test-pod
Pod teletest/test-pod successfully patched. Waiting for container to become ready.
Ephemeral container ephemeral-container is ready.
{"caller":"proxy/sess.go:643","component":"proxy:proxy:kube","error":"unable to upgrade connection: container ephemeral-container not found in pod test-pod_teletest","level":"warning","message":"Executor failed while streaming.","pid":"44273.16","session":"9b2b15fb-43da-497c-8990-7f1329f6c1f0","timestamp":"2024-05-15T17:57:41Z"}
Teleport > Closing session...

...*snip*...

{"caller":"streamproto/proto.go:173","component":null,"error":"websocket: close sent","level":"warning","message":"Failed to read message from websocket","timestamp":"2024-05-15T17:57:41Z"}
    kube_integration_test.go:1545: 
            Error Trace:    /__w/teleport/teleport/integration/kube_integration_test.go:1545
                                        /__w/teleport/teleport/integration/kube_integration_test.go:173
            Error:          Received unexpected error:
                            Internal error occurred: error executing command in container: unable to upgrade connection: container ephemeral-container not found in pod test-pod_teletest
            Test:           TestKube/EphemeralContainers

different line number, so it looks like it's still there after the data race issue unfortunately. Are you testing on a linux box or macos? Could we consider running the test container on EKS instead of using KinD? We already have an EKS cluster running for e2e tests in ./e2e/aws, might be worth it to try using that if the flakiness is just from the hacky kube cluster setup

capnspacehook commented 3 months ago

I'm testing on Linux using KinD which should be very similar to what the Action runners are doing. I'm not sure how unreliable KinD is since I haven't been able to repro this test failure even when running tests overnight... I messaged Tiago about this earlier today, hopefully he can help me figure out what's causing the failures

r0mant commented 3 months ago

@capnspacehook This is still failing: https://github.com/gravitational/teleport/actions/runs/9201793316/job/25310558380. This is from a v14 backport.

rosstimothy commented 3 months ago

This is particularly bad on v14. I've hit this a number of times this week on backports.

https://github.com/gravitational/teleport/actions/runs/9229501324/job/25395832828

rosstimothy commented 3 months ago

https://github.com/gravitational/teleport/actions/runs/9230006492/job/25397361578?pr=41991

ravicious commented 2 months ago

v16 https://github.com/gravitational/teleport/actions/runs/9352399815/job/25740482576#step:8:1359

     kube_integration_test.go:1546: 
            Error Trace:    /__w/teleport/teleport/integration/kube_integration_test.go:1546
                                        /__w/teleport/teleport/integration/kube_integration_test.go:174
            Error:          Received unexpected error:
                            Internal error occurred: error executing command in container: unable to upgrade connection: container ephemeral-container not found in pod test-pod_teletest
            Test:           TestKube/EphemeralContainers
ravicious commented 2 months ago

v14: https://github.com/gravitational/teleport/actions/runs/9353591607/job/25744471253#step:8:780

zmb3 commented 2 months ago

https://github.com/gravitational/teleport/actions/runs/9369157057/job/25792950653?pr=42364

rosstimothy commented 2 months ago

Possibly fixed by https://github.com/gravitational/teleport/pull/42068

tigrato commented 2 months ago

https://github.com/gravitational/teleport/pull/42068 fixed the issue. Closing