Open jezpac opened 1 month ago
Post \"https://kubernetes.default.svc/api/v1/namespaces/tailscale/secrets\": context deadline exceeded"," stacktrace":"main.initTSNet\n\ttailscale.com/cmd/k8s-operator/operator.go:156\nmain.main\n\ttailscale.com/cmd/k8s-operator/op erator.go:94\nruntime.main\n\truntime/proc.go:271"}
Hi, it looks like the operator is unable to reach the Kubernetes API server. You could verify that, for example by exec-ing the operator Pod and trying to reach the API server directly
$ kubectl exec -it deploy/operator -n tailscale -- sh
# from the Pod:
$ apk add curl
$ curl -k https://kubernetes.default.svc/healthz
Perhaps you have a NetworkPolicy that prevents that or maybe the API server cannot be reached on the DNS name in your cluster?
Thanks for your speedy reply @irbekrm ! It looks as through the pod can access the API server directly using curl.
/ # curl -k https://kubernetes.default.svc/healthz
ok/ #
Thanks for confirming that @jezpac .
We do have tested a recent-ish operator version on EKS, so I don't expect there to be an EKS specific issue at this point.
Perhaps hitting that particular endpoint times out for some reason or the timeout for curl
is higher than for our client. Could you try to hit the same Secret
creation endpoint from the operator and time it?
i.e
time curl -k https://kubernetes.default.svc/api/v1/namespaces/tailscale/secrets -H "Authorization: Bearer $(cat /run/se
crets/kubernetes.io/serviceaccount/token)" -X POST -d '{"metadata":{"name":"foo","namespace":"tailscale"},"kind":"Secret","
apiVersion":"v1"}' -H "Content-Type: application/json"
(this will create an empty Secret
named foo
in tailscale
namespace using operator's ServiceAccount
token- curious if that will succeed and how long it took)
Also - did you pass any custom configuration options when installing the operator?
Hi @irbekrm thanks for taking the time to come back. The only custom options passed were the oauth secret and Id (same as the instructions)
Looks like I was able to add the secret but it took a little over 10 seconds:
/ # time curl -k https://kubernetes.default.svc/api/v1/namespaces/tailscale/secrets -H "Authorization: Bearer $(cat
/run/secrets/kubernetes.io/serviceaccount/token)" -X POST -d '{"metadata":{"name":"foo","namespace":"tailscale"},"
kind":"Secret","apiVersion":"v1"}' -H "Content-Type: application/json"
{
"kind": "Secret",
"apiVersion": "v1",
"metadata": {
"name": "foo",
"namespace": "tailscale",
"uid": "5080031b-8a2f-47c1-a41b-7155e6364d22",
"resourceVersion": "13396996",
"creationTimestamp": "2024-08-05T23:12:16Z",
"managedFields": [
{
"manager": "curl",
"operation": "Update",
"apiVersion": "v1",
"time": "2024-08-05T23:12:16Z",
"fieldsType": "FieldsV1",
"fieldsV1": {
"f:type": {}
}
}
]
},
"type": "Opaque"
}real 0m 10.05s
user 0m 0.00s
sys 0m 0.00s
Is there an option to increase the timeout deadline for this operation when deploying the chart? 10 seconds seems like a while so might indicate the issue, however it's a vanilla EKS install so not quite sure how to address.
Thanks again for your help.
@irbekrm appreciating 10 seconds is slow for this operation, might this be resolvable if the timeout was configurable (perhaps by env var) in ipn/store/kubestore/store_kube.go
? Im not sure if its the ReadState
or the WriteState
that is raising in my circumstances, but I see the timeout is hardcoded to 5 seconds.
https://github.com/tailscale/tailscale/blob/main/ipn/store/kubestore/store_kube.go#L55 https://github.com/tailscale/tailscale/blob/main/ipn/store/kubestore/store_kube.go#L85
I'll keep digging for reasons for slow API performance on EKS in the meantime.
Thanks for digging into this!
Im not sure if its the ReadState or the WriteState that is raising in my circumstances, but I see the timeout is hardcoded to 5 seconds.
The operator will first try to create a Secret
named operator
and then read and write its state there.
If it hasn't been able to create it, you could try to precreate it (kubectl create secret generic operator -n tailscale
) and see if the following operations succeed.
We might be able to increase the timeout for our kube state client or make it modifyable. I'll need to first take a look at all the other places where we talk to kube API server to see what's the behaviour there (mostly the c/r client).
I'll keep digging for reasons for slow API performance on EKS in the meantime.
Perhaps some webhook that's intercepting Secret
creation calls? Unless all API calls take that long for that cluster.
Thanks @irbekrm I checked for MutatingWebhookConfigurations
in the cluster and there are those that come from AWS out of the box, plus their Cloudwatch Observability hook, but none have secret
scope. It seems that kubectl get secrets
is pretty snappy (definitely less than 5 seconds) but any mutating method (PATCH
,POST
) is slow. It's a multi-AZ cluster, so maybe etcd
has some speed issues there but I don't see much evidence of that in the wild.
I tried to create the generic secret first, but sadly it's got the same error, albeit timing out on the PATCH
operation.
{
"level": "fatal",
"ts": "2024-08-10T00:59:57Z",
"logger": "startup",
"msg": "starting tailscale server: tsnet: starting backend: initMachineKeyLocked: error patching Secret operator with a /data field: Patch \"https://kubernetes.default.svc/api/v1/namespaces/tailscale/secrets/operator\": context deadline exceeded",
"stacktrace": "main.initTSNet\n\ttailscale.com/cmd/k8s-operator/operator.go:156\nmain.main\n\ttailscale.com/cmd/k8s-operator/operator.go:94\nruntime.main\n\truntime/proc.go:271"
}
I'll keep digging but thanks for the help.
hi @irbekrm just an update I haven't been able to find any obvious cause for the Kube API calls being so slow for mutations in EKS. I'll leave this issue open for reference in case someone else has the same issue.
@jezpac thanks for testing and getting back to us.
We'll take a look at removing the hardcoded timeout so it works in cases like yours
What is the issue?
On a managed EKS cluster, installing the Tailscale operator succeeds, however the operator deployment goes into a CrashLoopBackoff as it fails to write the machine key to the
operator
secret via the kube apiHere is the output of the operator pod
Steps to reproduce
kubernetes logs deployment/operator -n tailscale
Are there any recent changes that introduced the issue?
No response
OS
Linux
OS version
No response
Tailscale version
Helm Chart Version 1.70.0
Other software
No response
Bug report
No response