Node won't be ready on EKS ( status is always pending )

acgacgacgacg commented 2 years ago

I basically followed the guide of deploying to Kubernetes, but the node shows to be not ready. here is the result.

ubuntu@ip-172-31-7-14:~/livekit-k8ts$ kubectl get pod -n kube-system
NAME                                            READY   STATUS    RESTARTS   AGE
aws-load-balancer-controller-66cffc9868-66vgl   1/1     Running   0          162m
aws-load-balancer-controller-66cffc9868-zx2qb   1/1     Running   0          162m
coredns-9f6f89c76-6zbl8                         1/1     Running   0          3h11m
coredns-9f6f89c76-q98f7                         1/1     Running   0          3h11m
livekit-server-58588cc88c-9hbx7                 0/1     Pending   0          39m

I use ALB as load balancer and created the public certificate inside ACM. And I skipped the Importing SSL Certificates step inside the guide.

Here is the values.yaml file I used.

replicaCount: 1

livekit:
  # port: 7880
  log_level: info
  rtc:
    use_external_ip: true
    # default ports used
    port_range_start: 50000
    port_range_end: 60000
    tcp_port: 7801
  redis:
    # address: <redis_host:port>
    # db: 0
    # username:
    # password:
  # one or more API key/secret pairs
  # see https://docs.livekit.io/guides/getting-started/#generate-api-key-and-secret
  keys:
    myapikey: API6JLCdtsxYeCp
  turn:
    enabled: true
    # must match domain of your tls cert
    domain: livekit-turn.room.link
    # tls_port must be 443 if turn load balancer is disabled
    tls_port: 3478
    # udp_port should be 443 for best connectivity through firewalls
    udp_port: 443
    secretName: eCkvaOf5BQVfig62fnjK02foYNtRBflYCn68fKvwKSjP
    # valid values: disable, aws, gke, do
    # tls cert and domain are required, even when load balancer is disabled
    loadBalancerType: disable

loadBalancer:
  # valid values: disable, alb, aws, gke, gke-managed-cert, do
  # on AWS, we recommend using alb load balancer, which supports TLS termination
  # in order to use alb, aws-ingress-controller must be installed
  # https://docs.aws.amazon.com/eks/latest/userguide/alb-ingress.html
  # for gke-managed-cert type follow https://cloud.google.com/kubernetes-engine/docs/how-to/managed-certs
  # and set staticIpName to your reserved static IP
  # for DO follow https://www.digitalocean.com/community/tutorials/how-to-set-up-an-nginx-ingress-on-digitalocean-kubernetes-using-helm
  # steps 2 and 4 to setup your ingress controller
  type: alb
  # staticIpName: <nameofIpAddressCreated>
  # Uncomment and enter host names if TLS is desired.
  # TLS is not supported with `aws` load balancer
  tls:
    # - hosts:
    #   - livekit.myhost.com
    # with ALB, certificates needs to reside in ACM for self-discovery
    # with GKE, specify one or more secrets to use for the certificate
    # with DO, use cert-manager and create certificate for turn. Load balancer is autoamtic
    # see: https://cloud.google.com/kubernetes-engine/docs/how-to/ingress-multi-ssl#specifying_certificates_for_your_ingress
    #   secretName: <mysecret>

# when true (default), optimizes network stack for service
# increases UDP send and receive buffers
optimizeNetwork: true

# autoscaling requires resources to be defined
autoscaling:
  # set to true to enable autoscaling. when set, ignores replicaCount
  enabled: true
  minReplicas: 1
  maxReplicas: 5
  targetCPUUtilizationPercentage: 60

# if LiveKit should run only on specific nodes
# this can be used to isolate designated nodes
nodeSelector: {}
  # node.kubernetes.io/instance-type: c5.2xlarge

resources: {}
  # Due to port restrictions, you can run only one instance of LiveKit per physical
  # node. Because of that, we recommend giving it plenty of resources to work with
  # limits:
  #   cpu: 6000m
  #   memory: 2048Mi
  # requests:
  #   cpu: 4000m
  #   memory: 1024Mi

I don't know what I'm missing. Would you please help me solve this problem ?

acgacgacgacg commented 2 years ago

Update:

fargate cluster is not supported, and I managed to get pod started, however, the FailedMount warning occurred, and stucked.

Events:
  Type     Reason       Age                    From               Message
  ----     ------       ----                   ----               -------
  Normal   Scheduled    12m                    default-scheduler  Successfully assigned kube-system/livekit-server-58588cc88c-lwgz5 to ip-192-168-7-203.ap-northeast-1.compute.internal
  Warning  FailedMount  3m20s (x2 over 7m50s)  kubelet            Unable to attach or mount volumes: unmounted volumes=[lkturncert], unattached volumes=[kube-api-access-9g2rf lkturncert]: timed out waiting for the condition
  Warning  FailedMount  112s (x13 over 12m)    kubelet            MountVolume.SetUp failed for volume "lkturncert" : secret "eCkvaOf5BQVfig62fnjK02foYNtRBflYCn68fKvwKSjP" not found
  Warning  FailedMount  66s (x3 over 10m)      kubelet            Unable to attach or mount volumes: unmounted volumes=[lkturncert], unattached volumes=[lkturncert kube-api-

acgacgacgacg commented 2 years ago

Update:

I added the kubectl create secret tls <name> --cert <cert-file> --key <key-file> --namespace <namespace> and solved the volume error. I misunderstood what the secretName mean.

Now, another problem. I can't pass the connection-test, even I opened the specified ports in security group of the load balancer. Here is the screenshot of the connection test

acgacgacgacg commented 2 years ago

I managed to make the WebSocket connection works, but failed to establish WebRTC connection.

And the log of my livekit node shows like the following. It doesn't seem to be the firewall problem as displayed in the warning.

2022-03-30T07:08:31.788Z        INFO    livekit rtc/room.go:225 new participant joined  {"room": "stark-tower", "roomID": "RM_EoTJgbvg3pmz", "pID": "PA_vtCxoWnG8FQY", "participant": "tony_stark", "protocol": 5, "options": {"AutoSubscribe":true}}
2022-03-30T07:08:31.788Z        INFO    livekit service/rtcservice.go:203       new client WS connected {"room": "stark-tower", "participant": "tony_stark", "connID": "tony_stark", "roomID": "RM_EoTJgbvg3pmz"}
2022-03-30T07:08:37.756Z        INFO    livekit service/rtcservice.go:184       server closing WS connection    {"room": "stark-tower", "participant": "tony_stark", "connID": "tony_stark"}
2022-03-30T07:08:38.739Z        INFO    livekit rtc/room.go:225 new participant joined  {"room": "stark-tower", "roomID": "RM_EoTJgbvg3pmz", "pID": "PA_ShPgqEotP4DM", "participant": "tony_stark", "protocol": 5, "options": {"AutoSubscribe":true}}
2022-03-30T07:08:38.739Z        INFO    livekit service/rtcservice.go:203       new client WS connected {"room": "stark-tower", "participant": "tony_stark", "connID": "tony_stark", "roomID": "RM_EoTJgbvg3pmz"}
2022-03-30T07:08:43.799Z        INFO    livekit service/rtcservice.go:184       server closing WS connection    {"room": "stark-tower", "participant": "tony_stark", "connID": "tony_stark"}
2022-03-30T07:09:13.153Z        INFO    livekit rtc/room.go:485 closing room    {"room": "stark-tower", "roomID": "RM_EoTJgbvg3pmz"}
2022-03-30T07:09:13.153Z        INFO    livekit logger/logger.go:26     deleting room state     {"room": "stark-tower"}
2022-03-30T07:09:13.153Z        INFO    livekit logger/logger.go:26     room closed

acgacgacgacg commented 2 years ago

Still cannot deploy. Stuck in the establishing WebRTC connection

Here is my values.yaml, anyone please help !!!!

replicaCount: 1

livekit:
  port: 7880
  log_level: info
  rtc:
    use_external_ip: true
    # default ports used
    port_range_start: 50000
    port_range_end: 60000
    tcp_port: 7881
  redis:
    address: my-redis-master.kube-system.svc.cluster.local:6379
    db: 0
    username: ""
    password: "4F18ejfu6T"
  # one or more API key/secret pairs
  # see https://docs.livekit.io/guides/getting-started/#generate-api-key-and-secret
  keys:
    AP********: eCkva*****************************
  turn:
    enabled: true
    # must match domain of your tls cert
    domain: livekit-turn.release.room8.link
    # tls_port must be 443 if turn load balancer is disabled
    tls_port: 5349
    # udp_port should be 443 for best connectivity through firewalls
    udp_port: 443
    secretName: secret-wildcard
    # valid values: disable, aws, gke, do
    # tls cert and domain are required, even when load balancer is disabled
    loadBalancerType: disable

loadBalancer:
  # valid values: disable, alb, aws, gke, gke-managed-cert, do
  # on AWS, we recommend using alb load balancer, which supports TLS termination
  # in order to use alb, aws-ingress-controller must be installed
  # https://docs.aws.amazon.com/eks/latest/userguide/alb-ingress.html
  # for gke-managed-cert type follow https://cloud.google.com/kubernetes-engine/docs/how-to/managed-certs
  # and set staticIpName to your reserved static IP
  # for DO follow https://www.digitalocean.com/community/tutorials/how-to-set-up-an-nginx-ingress-on-digitalocean-kubernetes-using-helm
  # steps 2 and 4 to setup your ingress controller
  type: alb
  # staticIpName: <nameofIpAddressCreated>
  # Uncomment and enter host names if TLS is desired.
  # TLS is not supported with `aws` load balancer
  tls:
    - hosts:
        - livekit.release.room8.link
    # with ALB, certificates needs to reside in ACM for self-discovery
    # with GKE, specify one or more secrets to use for the certificate
    # with DO, use cert-manager and create certificate for turn. Load balancer is autoamtic
    # see: https://cloud.google.com/kubernetes-engine/docs/how-to/ingress-multi-ssl#specifying_certificates_for_your_ingress
    #   secretName: secret-wildcard

# when true (default), optimizes network stack for service
# increases UDP send and receive buffers
optimizeNetwork: true

# autoscaling requires resources to be defined
autoscaling:
  # set to true to enable autoscaling. when set, ignores replicaCount
  enabled: true
  minReplicas: 1
  maxReplicas: 5
  targetCPUUtilizationPercentage: 60

# if LiveKit should run only on specific nodes
# this can be used to isolate designated nodes
nodeSelector: {}
  # node.kubernetes.io/instance-type: c5.2xlarge

resources: {}
  # Due to port restrictions, you can run only one instance of LiveKit per physical
  # node. Because of that, we recommend giving it plenty of resources to work with
  # limits:
  #   cpu: 6000m
  #   memory: 2048Mi
  # requests:
  #   cpu: 4000m
  #   memory: 1024Mi

acgacgacgacg commented 2 years ago

And here are the pods list. Also I checked the log and describe for livekit pod, no errors displayed.

$ kubectl get pod -A
NAMESPACE     NAME                                            READY   STATUS    RESTARTS   AGE
kube-system   aws-load-balancer-controller-6dcb7c6975-lfmcf   1/1     Running   0          25m
kube-system   aws-load-balancer-controller-6dcb7c6975-r46s2   1/1     Running   0          25m
kube-system   aws-node-fc8k2                                  1/1     Running   0          52m
kube-system   aws-node-zmpnn                                  1/1     Running   0          52m
kube-system   coredns-76f4967988-hwhlv                        1/1     Running   0          60m
kube-system   coredns-76f4967988-tjqkz                        1/1     Running   0          60m
kube-system   kube-proxy-sf8dn                                1/1     Running   0          52m
kube-system   kube-proxy-znl8p                                1/1     Running   0          52m
kube-system   livekit-server-7ddbf7576d-txsp6                 1/1     Running   0          19m
kube-system   livekit-server-sysctl-j62gw                     1/1     Running   0          19m
kube-system   livekit-server-sysctl-k55xh                     1/1     Running   0          19m
kube-system   my-redis-master-0                               1/1     Running   0          23m
kube-system   my-redis-replicas-0                             1/1     Running   0          23m
kube-system   my-redis-replicas-1                             1/1     Running   0          22m
kube-system   my-redis-replicas-2                             1/1     Running   0          21m

acgacgacgacg commented 2 years ago

Finally I am able to deploy livekit to EKS. Totally because I'm new to infrastructure like network settings.

The above establishing WebRTC connection failure is due to the NAT gateway. The NAT gateway is created automatically if you use eksctl. Here is the command I used to create a cluster. The last row used to disable NAT is important. ( I'm a fool not noticing that inside the document, it requires internet connection to the nodes )

eksctl create cluster \
--name livekit-server \
--region ap-northeast-1 \
--nodegroup-name livekit-server \
--nodes 2 \
--nodes-min 1 \
--nodes-max 5 \
--with-oidc \
--ssh-access \
--ssh-public-key "aws_admin" \
--managed \
--node-type c5.large \
--node-volume-size=80 \
--vpc-nat-mode Disable

livekit / livekit-helm

Node won't be ready on EKS ( status is always pending ) #22