tailscale / tailscale

The easiest, most secure way to use WireGuard and 2FA.
https://tailscale.com
BSD 3-Clause "New" or "Revised" License
18.57k stars 1.43k forks source link

Tailscale Operator on EKS - Unable to Write Machine Key to Store #13024

Open jezpac opened 1 month ago

jezpac commented 1 month ago

What is the issue?

On a managed EKS cluster, installing the Tailscale operator succeeds, however the operator deployment goes into a CrashLoopBackoff as it fails to write the machine key to the operator secret via the kube api

Here is the output of the operator pod

 {"level":"debug","ts":"2024-08-04T03:53:36Z","logger":"tailscaled","msg":"Bringing router up..."}                             
 {"level":"debug","ts":"2024-08-04T03:53:36Z","logger":"tailscaled","msg":"[v1] warning: fakeRouter.Up: not implemented."}     
 {"level":"debug","ts":"2024-08-04T03:53:36Z","logger":"tailscaled","msg":"Clearing router settings..."}                       
 {"level":"debug","ts":"2024-08-04T03:53:36Z","logger":"tailscaled","msg":"[v1] warning: fakeRouter.Set: not implemented."}    
 {"level":"debug","ts":"2024-08-04T03:53:36Z","logger":"tailscaled","msg":"Starting network monitor..."}                       
 {"level":"debug","ts":"2024-08-04T03:53:36Z","logger":"tailscaled","msg":"Engine created."}                                   
 {"level":"debug","ts":"2024-08-04T03:53:36Z","logger":"tailscaled","msg":"wg: [v2] Routine: receive incoming receiveDERP - st 
 arted"}                                                                                                                       
 {"level":"debug","ts":"2024-08-04T03:53:36Z","logger":"tailscaled","msg":"wg: [v2] Routine: receive incoming mkReceiveFunc -  
 started"}                                                                                                                     
 {"level":"debug","ts":"2024-08-04T03:53:36Z","logger":"tailscaled","msg":"wg: [v2] Routine: receive incoming mkReceiveFunc -  
 started"}                                                                                                                     
 {"level":"debug","ts":"2024-08-04T03:53:36Z","logger":"tailscaled","msg":"pm: migrating \"_daemon\" profile to new format"}   
 {"level":"debug","ts":"2024-08-04T03:53:36Z","logger":"tailscaled","msg":"logpolicy: using system state directory \"/var/lib/ 
 tailscale\""}                                                                                                                 
 {"level":"debug","ts":"2024-08-04T03:53:36Z","logger":"tailscaled","msg":"[v1] netmap packet filter: (not ready yet)"}        
 2024/08/04 03:53:36 tsnet starting with hostname "tailscale-operator", varRoot "/root/.config/tsnet-operator"                 
 {"level":"debug","ts":"2024-08-04T03:53:36Z","logger":"tailscaled","msg":"Start"}                                             
 {"level":"debug","ts":"2024-08-04T03:53:36Z","logger":"tailscaled","msg":"[v\u0000JSON]1{\"Hostinfo\":{\"IPNVersion\":\"1.70. 
 0-td601f16e1\",\"BackendLogID\":\"9e9f158f52cf18890cdd99dd08d881e455e3d8b9a3b28dfc98900af719f807d9\",\"OS\":\"linux\",\"OSVer 
 sion\":\"5.10.219-208.866.amzn2.x86_64\",\"Container\":false,\"Env\":\"k8s\",\"Distro\":\"alpine\",\"DistroVersion\":\"3.18.7 
 \",\"App\":\"k8s-operator\",\"Desktop\":false,\"Package\":\"tsnet\",\"Hostname\":\"operator-f6cf896b9-h8ljk\",\"Machine\":\"x 
 86_64\",\"GoArch\":\"amd64\",\"GoArchVar\":\"v1\",\"GoVersion\":\"go1.22.5\",\"Cloud\":\"aws\",\"Userspace\":true,\"Userspace 
 Router\":true,\"AppConnector\":false}}"}                                                                                      
 {"level":"debug","ts":"2024-08-04T03:53:36Z","logger":"tailscaled","msg":"generating new machine key"}                        
 {"level":"debug","ts":"2024-08-04T03:53:41Z","logger":"tailscaled","msg":"error writing machine key to store: Post \"https:// 
 kubernetes.default.svc/api/v1/namespaces/tailscale/secrets\": context deadline exceeded"}                                     
 {"level":"fatal","ts":"2024-08-04T03:53:41Z","logger":"startup","msg":"starting tailscale server: tsnet: starting backend: in 
 itMachineKeyLocked: Post \"https://kubernetes.default.svc/api/v1/namespaces/tailscale/secrets\": context deadline exceeded"," 
 stacktrace":"main.initTSNet\n\ttailscale.com/cmd/k8s-operator/operator.go:156\nmain.main\n\ttailscale.com/cmd/k8s-operator/op 
 erator.go:94\nruntime.main\n\truntime/proc.go:271"}

Steps to reproduce

Are there any recent changes that introduced the issue?

No response

OS

Linux

OS version

No response

Tailscale version

Helm Chart Version 1.70.0

Other software

No response

Bug report

No response

irbekrm commented 1 month ago

Post \"https://kubernetes.default.svc/api/v1/namespaces/tailscale/secrets\": context deadline exceeded"," stacktrace":"main.initTSNet\n\ttailscale.com/cmd/k8s-operator/operator.go:156\nmain.main\n\ttailscale.com/cmd/k8s-operator/op erator.go:94\nruntime.main\n\truntime/proc.go:271"}

Hi, it looks like the operator is unable to reach the Kubernetes API server. You could verify that, for example by exec-ing the operator Pod and trying to reach the API server directly

$ kubectl exec -it deploy/operator -n tailscale -- sh
# from the Pod:
$ apk add curl
$ curl -k https://kubernetes.default.svc/healthz

Perhaps you have a NetworkPolicy that prevents that or maybe the API server cannot be reached on the DNS name in your cluster?

jezpac commented 1 month ago

Thanks for your speedy reply @irbekrm ! It looks as through the pod can access the API server directly using curl.

/ # curl -k https://kubernetes.default.svc/healthz
ok/ #
irbekrm commented 1 month ago

Thanks for confirming that @jezpac .

We do have tested a recent-ish operator version on EKS, so I don't expect there to be an EKS specific issue at this point.

Perhaps hitting that particular endpoint times out for some reason or the timeout for curl is higher than for our client. Could you try to hit the same Secret creation endpoint from the operator and time it? i.e

time curl -k https://kubernetes.default.svc/api/v1/namespaces/tailscale/secrets -H "Authorization: Bearer $(cat /run/se
crets/kubernetes.io/serviceaccount/token)" -X POST -d '{"metadata":{"name":"foo","namespace":"tailscale"},"kind":"Secret","
apiVersion":"v1"}' -H "Content-Type: application/json"

(this will create an empty Secret named foo in tailscale namespace using operator's ServiceAccount token- curious if that will succeed and how long it took)

Also - did you pass any custom configuration options when installing the operator?

jezpac commented 1 month ago

Hi @irbekrm thanks for taking the time to come back. The only custom options passed were the oauth secret and Id (same as the instructions)

Looks like I was able to add the secret but it took a little over 10 seconds:

/ # time curl -k https://kubernetes.default.svc/api/v1/namespaces/tailscale/secrets -H "Authorization: Bearer $(cat
 /run/secrets/kubernetes.io/serviceaccount/token)" -X POST -d '{"metadata":{"name":"foo","namespace":"tailscale"},"
kind":"Secret","apiVersion":"v1"}' -H "Content-Type: application/json"
{
  "kind": "Secret",
  "apiVersion": "v1",
  "metadata": {
    "name": "foo",
    "namespace": "tailscale",
    "uid": "5080031b-8a2f-47c1-a41b-7155e6364d22",
    "resourceVersion": "13396996",
    "creationTimestamp": "2024-08-05T23:12:16Z",
    "managedFields": [
      {
        "manager": "curl",
        "operation": "Update",
        "apiVersion": "v1",
        "time": "2024-08-05T23:12:16Z",
        "fieldsType": "FieldsV1",
        "fieldsV1": {
          "f:type": {}
        }
      }
    ]
  },
  "type": "Opaque"
}real   0m 10.05s
user    0m 0.00s
sys     0m 0.00s

Is there an option to increase the timeout deadline for this operation when deploying the chart? 10 seconds seems like a while so might indicate the issue, however it's a vanilla EKS install so not quite sure how to address.

Thanks again for your help.

jezpac commented 1 month ago

@irbekrm appreciating 10 seconds is slow for this operation, might this be resolvable if the timeout was configurable (perhaps by env var) in ipn/store/kubestore/store_kube.go? Im not sure if its the ReadState or the WriteState that is raising in my circumstances, but I see the timeout is hardcoded to 5 seconds.

https://github.com/tailscale/tailscale/blob/main/ipn/store/kubestore/store_kube.go#L55 https://github.com/tailscale/tailscale/blob/main/ipn/store/kubestore/store_kube.go#L85

I'll keep digging for reasons for slow API performance on EKS in the meantime.

irbekrm commented 1 month ago

Thanks for digging into this!

Im not sure if its the ReadState or the WriteState that is raising in my circumstances, but I see the timeout is hardcoded to 5 seconds.

The operator will first try to create a Secret named operator and then read and write its state there. If it hasn't been able to create it, you could try to precreate it (kubectl create secret generic operator -n tailscale) and see if the following operations succeed.

We might be able to increase the timeout for our kube state client or make it modifyable. I'll need to first take a look at all the other places where we talk to kube API server to see what's the behaviour there (mostly the c/r client).

I'll keep digging for reasons for slow API performance on EKS in the meantime.

Perhaps some webhook that's intercepting Secret creation calls? Unless all API calls take that long for that cluster.

jezpac commented 1 month ago

Thanks @irbekrm I checked for MutatingWebhookConfigurations in the cluster and there are those that come from AWS out of the box, plus their Cloudwatch Observability hook, but none have secret scope. It seems that kubectl get secrets is pretty snappy (definitely less than 5 seconds) but any mutating method (PATCH,POST) is slow. It's a multi-AZ cluster, so maybe etcd has some speed issues there but I don't see much evidence of that in the wild. I tried to create the generic secret first, but sadly it's got the same error, albeit timing out on the PATCH operation.

{
    "level": "fatal",
    "ts": "2024-08-10T00:59:57Z",
    "logger": "startup",
    "msg": "starting tailscale server: tsnet: starting backend: initMachineKeyLocked: error patching Secret operator with a /data field: Patch \"https://kubernetes.default.svc/api/v1/namespaces/tailscale/secrets/operator\": context deadline exceeded",
    "stacktrace": "main.initTSNet\n\ttailscale.com/cmd/k8s-operator/operator.go:156\nmain.main\n\ttailscale.com/cmd/k8s-operator/operator.go:94\nruntime.main\n\truntime/proc.go:271"
}

I'll keep digging but thanks for the help.

jezpac commented 4 weeks ago

hi @irbekrm just an update I haven't been able to find any obvious cause for the Kube API calls being so slow for mutations in EKS. I'll leave this issue open for reference in case someone else has the same issue.

irbekrm commented 4 weeks ago

@jezpac thanks for testing and getting back to us.

We'll take a look at removing the hardcoded timeout so it works in cases like yours