gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.59k stars 1.76k forks source link

Teleport seems to kill long running websockets TCP connections #35198

Closed alexandreMegel closed 11 months ago

alexandreMegel commented 11 months ago

Hello,

I am facing an issue where Teleport seems to kill TCP connections after ~30 seconds. I have deployed Teleport in an AWS EKS cluster and I have an web application running in another AWS EKS cluster. Teleport cluster is behind an AWS ALB with idle timeout of 3600 seconds.

Expected behavior:

Teleport should forwards long running queries to the Teleport cluster proxy. I am using websockets

Current behavior:

I am facing HTTP 502 error when using my web application with Teleport and everything's good when i am not using Teleport

Teleport seems to kill Teleport TCP connections between the Teleport agent and Teleport proxy or between Teleport Proxy and my ALB

Teleport version:

  1. Teleport cluster: Version 14.2.0 deployed using teleport-cluster helm chart
  2. Teleport agent: Version 14.2.0 deployed using teleport-kube-agent helm chart with app auto discovery enabled

logs:

Teleport agent:

2023-11-30T09:54:19Z INFO [APP:SERVI] Round trip: POST http://**********.svc.cluster.local:8081/**********/socket/subscription, code: 200, duration: 400.14737ms tls:version: 304, tls:resume:true, tls:csuite:1301, tls:server:74656c65706f72742d65787465726e616c2d65752e6a756d702d736161732e636f6d.teleport.cluster.local reverseproxy/reverse_proxy.go:236
2023-11-30T09:54:56Z INFO [APP:SERVI] Round trip: POST http:///**********.svc.cluster.local:8081/**********/table_conservation, code: 200, duration: 36.839325093s tls:version: 304, tls:resume:true, tls:csuite:1301, tls:server:74656c65706f72742d65787465726e616c2d65752e6a756d702d736161732e636f6d.teleport.cluster.local reverseproxy/reverse_proxy.go:236

Everything seems ok from the Teleport agent

AWS ALB:

app/k8s-teleport-ingresst-b0a2042339/06d18c67f3eb0542 ************* 10.235.1.146:3080 0.000 44.513 -1 502 - 2270 595 "POST https://********************:443/*********/table_conservation HTTP/2.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-3:778273502744:targetgroup/k8s-teleport-teleport-db898572d7/655ecf6aa420b99a "Root=1-655621fb-62759b0e4c8ce8cd4325056c" "teleport-*******************.com" "arn:aws:acm:eu-west-3:778273502744:certificate/ed853ecb-90cf-*******************" 2 2023-11-16T14:06:51.148000Z "forward" "-" "-" "10.235.1.146:3080" "-" "-" "-"

We can see that the ALB didn't get any responses from Teleport cluster and then return a HTTP 502 error

Teleport configuration:

Teleport proxy configuration:

    auth_service:
      client_idle_timeout: 6h
      client_idle_timeout_message: Connection closed after 6hours without activity
      disconnect_expired_cert: false
      enabled: false
    proxy_service:
      enabled: true
      https_keypairs:
      - cert_file: /etc/teleport-tls/tls.crt
        key_file: /etc/teleport-tls/tls.key
      https_keypairs_reload_interval: 12h
      public_addr: teleport-************.com:443
      trust_x_forwarded_for: true
    ssh_service:
      enabled: false
    teleport:
      auth_server: teleport-auth.teleport-external.svc.cluster.local:3025
      join_params:
        method: kubernetes
        token_name: teleport-proxy
      log:
        format:
          extra_fields:
          - timestamp
          - level
          - component
          - caller
          output: text
        output: stderr
        severity: DEBUG
    version: v3

Teleport auth configuration:

    auth_service:
      # keep_alive_interval determines the interval at which Teleport will 
      # send keep-alive messages for client and reverse tunnel connections.
      # The default is set to 5 minutes (300 seconds) to stay lower than the
      # common load balancer timeout of 350 seconds.
      # keep_alive_count_max is the number of missed keep-alive messages before
      # the Teleport cluster tears down the connection to the client or service.
      keep_alive_interval: 30s
      keep_alive_count_max: 300

      authentication:
        local_auth: true
        second_factor: "on"
        type: local
        webauthn:
          rp_id: teleport-**************.com
      cluster_name: teleport-**************.com
      enabled: true
      proxy_listener_mode: multiplex
      session_recording: node-sync
    kubernetes_service:
      enabled: true
      kube_cluster_name: shared-services
      listen_addr: 0.0.0.0:3026
      public_addr: teleport-auth.teleport-external.svc.cluster.local:3026
    proxy_service:
      enabled: false
    ssh_service:
      enabled: false
    teleport:
      auth_server: 127.0.0.1:3025
      log:
        format:
          extra_fields:
          - timestamp
          - level
          - component
          - caller
          output: text
        output: stderr
        severity: DEBUG
      storage:
        audit_events_uri:
        - dynamodb://******************
        audit_sessions_uri: s3://*****************
        auto_scaling: false
        continuous_backups: true
        region: eu-west-3
        table_name: teleport-*************************-table
        type: dynamodb
    version: v3

Teleport agent configuration deployed in the AWS EKS cluster where the web application is running:

app_service:
  enabled: true
  resources:
  - labels:
      teleport.dev/kubernetes-cluster: *************
      teleport.dev/origin: discovery-kubernetes
auth_service:
  enabled: false
db_service:
  enabled: false
discovery_service:
  discovery_group: ************
  enabled: true
  kubernetes:
  - labels:
      teleport/discovery: enabled
    namespaces:
    - '*'
    types:
    - app
kubernetes_service:
  enabled: true
  kube_cluster_name: ***********
  labels:
    bot: allowed
    customer: *********
    kube_region: eu-west-3
proxy_service:
  enabled: false
ssh_service:
  enabled: false
teleport:
  join_params:
    method: token
    token_name: /etc/teleport-secrets/auth-token
  log:
    format:
      extra_fields:
      - timestamp
      - level
      - component
      - caller
      output: text
    output: stderr
    severity: INFO
  proxy_server: teleport-**************.com:443
version: v3

Thank you for your help.

gecube commented 11 months ago

Hi @alexandreMegel

Glad to see you here! I am not team member of Teleport, but I faced the same issue.

I installed teleport cluster as a helm chart. I used DO installation instruction, because I did not want to create the amazon s3 and other amazon related stuff. But I forget to set the proper annotations on teleport service. So I had many times the same issue as you. Finally, I realised that I need to set the proper annotations. And here is the snippet:

    proxy:
      highAvailability:
        replicaCount: 1
      annotations:
        serviceAccount:
          eks.amazonaws.com/role-arn: arn:aws:iam::308712144460:role/teleport-discovery
    annotations:
      service:
        service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
        service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: '60'
        service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: 'true'
        service.beta.kubernetes.io/aws-load-balancer-type: nlb
        service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing

works like a charm for me! Also it is worth mentioning that teleport is deployed on top of EKS cluster with aws-load-balancer-controller installed. I hope this information will help you.

alexandreMegel commented 11 months ago

Hi @gecube , thank you for you answer.

I am currently not using a NLB but a Layer 7 ALB, all my k8s services are ClusterIP type.

However, I have already setup my ingress with the right ALB annotations (I think):

metadata:
  annotations:
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:eu-west-3:778273502744:certificate/ed853ecb-90cf-*******
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/ssl-redirect: "443"
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig":
      { "Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
    alb.ingress.kubernetes.io/backend-protocol: HTTPS
    alb.ingress.kubernetes.io/healthcheck-protocol: HTTPS
    alb.ingress.kubernetes.io/success-codes: 200-310
    alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=3600
gecube commented 11 months ago

it's not recommended setup. Please refer to https://github.com/gravitational/teleport/blob/3b8aba9779b82addd19afde960cc3e1782a5b670/examples/chart/teleport-cluster/templates/proxy/service.yaml#L23 So NLB is recommended by teleport developers. No idea what's wrong with your ALB, sorry.

zmb3 commented 11 months ago

This was recently fixed and will be available in 14.2.1 in the next couple of days.

zmb3 commented 11 months ago

Fixed by #34843