telepresenceio / telepresence

Local development against a remote Kubernetes or OpenShift cluster
https://www.telepresence.io
Other
6.62k stars 521 forks source link

telepresence intercept timed out [v2.6.6] #2654

Closed iyezhou closed 9 months ago

iyezhou commented 2 years ago

hi, I follow this page: https://www.getambassador.io/resources/easy-debugging-of-spring-boot-microservices/

but got timed out error

# telepresence status
Root Daemon: Running
  Version   : v2.6.6 (api 3)
  DNS       :
    Local IP        : 192.168.170.21
    Remote IP       : <nil>
    Exclude suffixes: [.arpa .com .io .net .org .ru]
    Include suffixes: []
    Timeout         : 8s
  Also Proxy : (0 subnets)
  Never Proxy: (0 subnets)
User Daemon: Running
  Version         : v2.6.6 (api 3)
  Executable      : /usr/bin/telepresence
  Install ID      : a019880e-d361-42e2-9647-a534b6cd6073
  Ambassador Cloud:
    Status    : Logged out
    User ID   : 
    Account ID: 
    Email     : 
  Status            : Connected
  Kubernetes server : https://llgp.com:6443
  Kubernetes context: kubernetes-admin@kubernetes
  Intercepts        : 0 total

[root@TEST-AS-2009-V1066 ~]# telepresence list
dataprocessingservice: ready to intercept (traffic-agent not yet installed)
verylargedatastore   : ready to intercept (traffic-agent not yet installed)
verylargejavaservice : ready to intercept (traffic-agent not yet installed)

[root@TEST-AS-2009-V1066 ~]# curl http://dataprocessingservice.default.svc.cluster.local:3000/color
curl: (7) Failed connect to dataprocessingservice.default.svc.cluster.local:3000; Connection timed out

[root@TEST-AS-2009-V1066 ~]# 
[root@TEST-AS-2009-V1066 ~]# telepresence intercept dataprocessingservice --port 3000
telepresence: error: rpc error: code = DeadlineExceeded desc = request timed out while waiting for agent dataprocessingservice.default to arrive

See logs for details (2 errors found): "/root/.cache/telepresence/logs/daemon.log"

See logs for details (1 error found): "/root/.cache/telepresence/logs/connector.log"
If you think you have encountered a bug, please run `telepresence gather-logs` and attach the telepresence_logs.zip to your github issue or create a new one: https://github.com/telepresenceio/telepresence/issues/new?template=Bug_report.md .
[root@TEST-AS-2009-V1066 ~]# 
njayp commented 2 years ago

can you run telepresence gather-logs and post the zip folder here?

DanielLiu1123 commented 2 years ago

v2.6.8 some problem

# telepresence status
Root Daemon: Running
  Version   : v2.6.8 (api 3)
  DNS       :
    Remote IP       : 127.0.0.1
    Exclude suffixes: [.arpa .com .io .net .org .ru]
    Include suffixes: []
    Timeout         : 8s
  Also Proxy : (0 subnets)
  Never Proxy: (1 subnets)
    - 127.0.0.1/32
User Daemon: Running
  Version         : v2.6.8 (api 3)
  Executable      : /Users/macbook/Library/Application Support/telepresence/telepresence-pro
  Install ID      : bd001aca-95b9-4661-8b2d-cc8a85c2f529
  Ambassador Cloud:
    Status    : Logged in
    User ID   : 
    Account ID: 
    Email     : 
  Status            : Connected
  Kubernetes server : https://kubernetes.docker.internal:6443
  Kubernetes context: docker-desktop
  Intercepts        : 0 total

# telepresence list
config-watcher: ready to intercept (traffic-agent not yet installed)

# telepresence intercept config-watcher --port 8080
telepresence: error: rpc error: code = DeadlineExceeded desc = request timed out while waiting for agent config-watcher.default to arrive

See logs for details (1 error found): "/Users/macbook/Library/Logs/telepresence/connector.log"
If you think you have encountered a bug, please run `telepresence gather-logs` and attach the telepresence_logs.zip to your github issue or create a new one: https://github.com/telepresenceio/telepresence/issues/new?template=Bug_report.md .

telepresence_logs.zip

Thegsgs commented 2 years ago

I'm having the exact same problem.

MichaelDuda-Sendoso commented 2 years ago

I am assessing using Telepresence for our team, and am also running into the same issue on both a local cluster (Docker for Desktop on MacOS) and an AWS EKS cluster. To further investigate on the local cluster, I tried downgrading to version 2.5.8 (both client and TM), and with that I WAS able to start an intercept successfully. I also tried several older 2.6.* versions though, and all of them failed with the timeout error. (Unfortunately, I don't have the permissions to downgrade the TM on our EKS cluster myself, so I won't be able to try that right away; trying to connect to the 2.6.8 TM with the 2.5.8 client resulted in an expected error). Hope this helps.

raphaelreyna commented 2 years ago

I can confirm I also came across this issue; upgrading to 2.6.8 fixed it for me.

singhals commented 2 years ago

FWIW I'm seeing the same issue and downgraded to 2.5.8 from 2.6.8.

With 2.6.8 - two logs that may be pertinent:

Doing a describe intercepted pod I see the traffic-agent sidecar fails to start due to this issue:

Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory

It was hard to diagnose if this was an issue with the upstream docker image or not.

Secondly the traffic-manager pod is spamming logs with that seems to be also called out in this issue (https://github.com/telepresenceio/telepresence/issues/2669). This doesn't seem to be a blocker though.

2022-06-28 08:19:12.4948 warning agent-configs : Issuing a systema request without ApiKey or InstallID may result in an error
omdeshpande commented 2 years ago

I'm still seeing this error with version 2.7.3. Has anyone found a solution?

hzhangse commented 2 years ago

I'm still seeing this error with version 2.7.3. Has anyone found a solution?

me 2, how to resolve the problem?

japareja commented 2 years ago

Still same problem here on windows machine with v2.8.5 and connected to k8s cluster sucessfully

zbdba commented 2 years ago

Still same problem here on mac m1 with v2.9.2 and connected to k8s cluster sucessfully, and the client and TM is the same version.

yacoubb commented 1 year ago

Same problem as @zbdba with telepresence 2.9.5

gurpalw commented 1 year ago

EKS v1.21

 telepresence version                                                                                                                                                                              ✔  22:00:06 
Enhanced Client: v2.9.5
Root Daemon    : v2.9.5
User Daemon    : v2.9.5
Traffic Manager: v2.9.5
 telepresence intercept echo-server --port 8000:80 --namespace dev
telepresence intercept: error: connector.CreateIntercept: request timed out while waiting for agent echo-server.dev to arrive

when i describe the intercepted pod, i can see the traffic-agent container is failing its healthcheck.

Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory
thallgren commented 1 year ago

That failure is normal and usually fixes itself. What do you get from kubectl get events when this happens?

gurpalw commented 1 year ago

@thallgren

kubectl describe pod echo-server-1 -ndev | grep Events -A10                     
Events:
  Type     Reason     Age                    From     Message
  ----     ------     ----                   ----     -------
  Warning  Unhealthy  46m (x193 over 3h)     kubelet  Readiness probe failed: Get "http://100.69.10.43:8000/health": dial tcp 100.69.10.43:8000: connect: connection refused
  Warning  BackOff    11m (x501 over 170m)   kubelet  Back-off restarting failed container
  Warning  Unhealthy  6m7s (x96 over 3h)     kubelet  Liveness probe failed: Get "http://100.69.10.43:8000/health": dial tcp 100.69.10.43:8000: connect: connection refused
  Warning  Unhealthy  67s (x1081 over 3h1m)  kubelet  Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory
thallgren commented 1 year ago

What is this?

  Warning  Unhealthy  46m (x193 over 3h)     kubelet  Readiness probe failed: Get "http://100.69.10.43:8000/health": dial tcp 100.69.10.43:8000: connect: connection refused
  Warning  BackOff    11m (x501 over 170m)   kubelet  Back-off restarting failed container
  Warning  Unhealthy  6m7s (x96 over 3h)     kubelet  Liveness probe failed: Get "http://100.69.10.43:8000/health": dial tcp 100.69.10.43:8000: connect: connection refused
gurpalw commented 1 year ago

@thallgren before i start an intercept, my pod with 1 container, passes healthchecks, and is running. see the below manual test to confirm from another pod in the cluster trying to hit the pods service endpoints.

root@testpod:/# curl echo-server.dev.svc.cluster.local/health
{"status":"ok"}
root@testpod:/# curl echo-server.dev.svc.cluster.local
echo-server is up!

after i run telepresence connect i can reach the same endpoints from my macbook with no issues.

~  curl echo-server.dev.svc.cluster.local/health
{"status":"ok"}

however, after i run telepresence intercept echo-server --port 8000:80 --namespace dev, the above healthchecks immediately starts failing.


curl: (7) Failed to connect to echo-server.dev.svc.cluster.local port 80 afte
r 1040 ms: Connection refused

and telepresence intercept fails

telepresence intercept: error: connector.CreateIntercept: request timed out while waiting for agent echo-server.dev to arrive

in light of the above, i create a much simpler pod, without healthchecks in its container, and attempted to intercept its service. this time i get a slightly different error, but essentially the same thing.

telepresence intercept web --namespace dev                                                                                  ✔  16:06:09 
telepresence intercept: error: connector.CreateIntercept: request timed out while waiting for agent web.dev to arrive: Events that may be relevant:
AGE   TYPE      REASON      OBJECT                     MESSAGE
21s   Warning   Unhealthy   pod/web-78ffbccc6c-t7f8b   Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory

21s   Warning   Unhealthy   pod/web-78ffbccc6c-t7f8b   Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory

21s   Warning   Unhealthy   pod/web-78ffbccc6c-t7f8b   Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory

here are the traffic manager logs.

traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:34:38.3054 debug   httpd/conn=127.0.0.1:8081 : PrepareIntercept called : session_id="1eda010a-ad3b-4d61-b577-5c6fc5f3b916" (from cmd/traffic/cmd/manager/service.go:461)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:34:38.3325 debug   agent-configs : MODIFIED telepresence-agents.dev (from cmd/traffic/cmd/manager/internal/mutator/watcher.go:549)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:34:38.3326 debug   agent-configs : add web.dev (from cmd/traffic/cmd/manager/internal/mutator/watcher.go:266)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:34:38.3411 info    agent-configs : Successfully rolled out web.dev (from cmd/traffic/cmd/manager/internal/mutator/watcher.go:93)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:34:38.3645 debug   agent-injector/conn=100.69.103.43:443 : Received webhook request... (from cmd/traffic/cmd/manager/internal/mutator/service.go:78)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:34:38.3656 debug   agent-injector/conn=100.69.103.43:443 : Handling admission request CREATE web-cb5b749df-.dev (from cmd/traffic/cmd/manager/internal/mutator/agent_injector.go:92)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:34:38.3844 info    agent-injector/conn=100.69.103.43:443 : Injecting traffic-agent into pod web-cb5b749df-.dev (from cmd/traffic/cmd/manager/internal/mutator/agent_injector.go:173)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:34:38.3850 info    agent-injector/conn=100.69.103.43:443 : Injecting 7 patches into pod web-cb5b749df-.dev (from cmd/traffic/cmd/manager/internal/mutator/agent_injector.go:190)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:34:38.3867 debug   agent-injector/conn=100.69.103.43:443 : Webhook request handled successfully (from cmd/traffic/cmd/manager/internal/mutator/service.go:85)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:34:43.0901 error   httpd/conn=127.0.0.1:8081 : Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory
traffic-manager-579cf968df-mbb9k traffic-manager  : session_id="1eda010a-ad3b-4d61-b577-5c6fc5f3b916" (from cmd/traffic/cmd/manager/internal/state/intercept.go:375)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:34:48.5142 error   httpd/conn=127.0.0.1:8081 : Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory
traffic-manager-579cf968df-mbb9k traffic-manager  : session_id="1eda010a-ad3b-4d61-b577-5c6fc5f3b916" (from cmd/traffic/cmd/manager/internal/state/intercept.go:375)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:34:58.0837 debug   agent-injector/conn=100.69.103.43:443 : Received webhook request... (from cmd/traffic/cmd/manager/internal/mutator/service.go:78)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:34:58.0843 debug   agent-injector/conn=100.69.103.43:443 : Handling admission request CREATE cron-argo-security-group-manager-27857910-.tools (from cmd/traffic/cmd/manager/internal/mutator/agent_injector.go:92)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:34:58.0844 debug   agent-injector/conn=100.69.103.43:443 : The cron-argo-security-group-manager-27857910-.tools pod has not enabled traffic-agent container injection through "telepresence-agents" configmap or through "telepresence.getambassador.io/inject-traffic-agent" annotation; skipping (from cmd/traffic/cmd/manager/internal/mutator/agent_injector.go:133)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:34:58.0845 debug   agent-injector/conn=100.69.103.43:443 : Webhook request handled successfully (from cmd/traffic/cmd/manager/internal/mutator/service.go:85)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:34:58.5375 error   httpd/conn=127.0.0.1:8081 : Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory
traffic-manager-579cf968df-mbb9k traffic-manager  : session_id="1eda010a-ad3b-4d61-b577-5c6fc5f3b916" (from cmd/traffic/cmd/manager/internal/state/intercept.go:375)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:35:00.1262 debug   agent-injector/conn=100.69.103.43:443 : Received webhook request... (from cmd/traffic/cmd/manager/internal/mutator/service.go:78)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:35:00.1290 debug   agent-injector/conn=100.69.103.43:443 : Handling admission request CREATE cron-argo-security-group-manager-27857915-.tools (from cmd/traffic/cmd/manager/internal/mutator/agent_injector.go:92)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:35:00.1291 debug   agent-injector/conn=100.69.103.43:443 : The cron-argo-security-group-manager-27857915-.tools pod has not enabled traffic-agent container injection through "telepresence-agents" configmap or through "telepresence.getambassador.io/inject-traffic-agent" annotation; skipping (from cmd/traffic/cmd/manager/internal/mutator/agent_injector.go:133)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:35:00.1292 debug   agent-injector/conn=100.69.103.43:443 : Webhook request handled successfully (from cmd/traffic/cmd/manager/internal/mutator/service.go:85)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:35:10.8603 debug   podWatcher calling updateSubnets with [100.69.0.0/17 172.18.0.0/24] (from cmd/traffic/cmd/manager/internal/cluster/podwatcher.go:92)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:35:14.2924 debug   agent-injector/conn=100.69.103.43:443 : Received webhook request... (from cmd/traffic/cmd/manager/internal/mutator/service.go:78)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:35:14.2936 debug   agent-injector/conn=100.69.103.43:443 : Handling admission request CREATE cron-argo-security-group-manager-27857915-.tools (from cmd/traffic/cmd/manager/internal/mutator/agent_injector.go:92)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:35:14.2940 debug   agent-injector/conn=100.69.103.43:443 : The cron-argo-security-group-manager-27857915-.tools pod has not enabled traffic-agent container injection through "telepresence-agents" configmap or through "telepresence.getambassador.io/inject-traffic-agent" annotation; skipping (from cmd/traffic/cmd/manager/internal/mutator/agent_injector.go:133)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:35:14.2941 debug   agent-injector/conn=100.69.103.43:443 : Webhook request handled successfully (from cmd/traffic/cmd/manager/internal/mutator/service.go:85)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:35:23.0515 debug   agent-injector/conn=100.69.103.43:443 : Received webhook request... (from cmd/traffic/cmd/manager/internal/mutator/service.go:78)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:35:23.0522 debug   agent-injector/conn=100.69.103.43:443 : Handling admission request CREATE cron-argo-security-group-manager-27857905-.tools (from cmd/traffic/cmd/manager/internal/mutator/agent_injector.go:92)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:35:23.0523 debug   agent-injector/conn=100.69.103.43:443 : The cron-argo-security-group-manager-27857905-.tools pod has not enabled traffic-agent container injection through "telepresence-agents" configmap or through "telepresence.getambassador.io/inject-traffic-agent" annotation; skipping (from cmd/traffic/cmd/manager/internal/mutator/agent_injector.go:133)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:35:23.0524 debug   agent-injector/conn=100.69.103.43:443 : Webhook request handled successfully (from cmd/traffic/cmd/manager/internal/mutator/service.go:85)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:35:38.5020 debug   agent-injector/conn=100.69.103.43:443 : Received webhook request... (from cmd/traffic/cmd/manager/internal/mutator/service.go:78)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:35:38.5032 debug   agent-injector/conn=100.69.103.43:443 : Handling admission request CREATE cron-argo-security-group-manager-27857915-.tools (from cmd/traffic/cmd/manager/internal/mutator/agent_injector.go:92)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:35:38.5034 debug   agent-injector/conn=100.69.103.43:443 : The cron-argo-security-group-manager-27857915-.tools pod has not enabled traffic-agent container injection through "telepresence-agents" configmap or through "telepresence.getambassador.io/inject-traffic-agent" annotation; skipping (from cmd/traffic/cmd/manager/internal/mutator/agent_injector.go:133)
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-19 18:35:38.5035 debug   agent-injector/conn=100.69.103.43:443 : Webhook request handled successfully (from cmd/traffic/cmd/manager/internal/mutator/service.go:85)

After this, trying to run telepresence leave web results in no output and the pod keeps the injected container indefinitely.

here's a the result of curling my pods exposed port(should respond with the pod name) followed by trying the same on the new pod that gets created with the injected container, after i attempted to interecept.

root@test:/# curl 100.69.102.162:8000
web-5fbfc5b9fd-2clgk
root@gurps:/# curl 100.69.90.18:8000
curl: (7) Failed to connect to 100.69.90.18 port 8000 after 0 ms: Connection refused

events from teh new pod:

Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  76s               default-scheduler  Successfully assigned dev/web-cb5b749df-58xqr to ip-172-18-0-152.eu-west-1.compute.internal
  Normal   Pulled     75s               kubelet            Container image "docker.io/datawire/ambassador-telepresence-agent:1.13.5" already present on machine
  Normal   Created    75s               kubelet            Created container tel-agent-init
  Normal   Started    75s               kubelet            Started container tel-agent-init
  Normal   Pulling    74s               kubelet            Pulling image "x.dkr.ecr.eu-west-1.amazonaws.com/httpserver:latest"
  Normal   Pulled     74s               kubelet            Successfully pulled image "xx.dkr.ecr.eu-west-1.amazonaws.com/httpserver:latest" in 112.250343ms
  Normal   Created    74s               kubelet            Created container web-server
  Normal   Started    73s               kubelet            Started container web-server
  Normal   Pulled     73s               kubelet            Container image "docker.io/datawire/ambassador-telepresence-agent:1.13.5" already present on machine
  Normal   Created    73s               kubelet            Created container traffic-agent
  Normal   Started    73s               kubelet            Started container traffic-agent
  Warning  Unhealthy  6s (x8 over 71s)  kubelet            Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory
thallgren commented 1 year ago

Does the log from the traffic-agent show anything? E.g. kubectl -n dev logs web-78ffbccc6c-t7f8b traffic-agent.

I see that a tel-agent-init container is used. That's often unnecessary and will cause problems in some setups. Try using symbolic targetPorts in you services if possible. You can read more about that here.

gurpalw commented 1 year ago

@thallgren thanks for your reply.

For further clarity here is my yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  namespace: dev
spec:
  replicas: 1
  selector:
    matchLabels:
      server: web-server
  template:
    metadata:
      labels:
        server: web-server
    spec:
      containers:
      - name: web-server
        image: xxxx
        imagePullPolicy: Always
        ports:
        - name: http
          containerPort: 8000
apiVersion: v1
kind: Service
metadata:
  name: web
  namespace: dev
spec:
  selector:
    server: web-server
  ports:
  - port: 8000
    targetPort: http
  type: ClusterIP

telepresence intercept web --port 8000 --namespace=dev

telepresence intercept: error: connector.CreateIntercept: request timed out while waiting for agent web.dev to arrive: Events that may be relevant:
AGE   TYPE      REASON      OBJECT                     MESSAGE
28s   Warning   Unhealthy   pod/web-68d75fbcdf-dh8cj   Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory

28s   Warning   Unhealthy   pod/web-68d75fbcdf-dh8cj   Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory

28s   Warning   Unhealthy   pod/web-68d75fbcdf-dh8cj   Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory

traffic-manager logs

traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-20 11:09:30.4315 info    agent-configs : Successfully rolled out web.dev
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-20 11:09:30.4693 info    agent-injector/conn=100.69.103.43:443 : Injecting traffic-agent into pod web-68d75fbcdf-.dev
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-20 11:09:30.4694 info    agent-injector/conn=100.69.103.43:443 : Injecting 7 patches into pod web-68d75fbcdf-.dev
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-20 11:09:32.2441 error   httpd/conn=127.0.0.1:8081 : Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory
traffic-manager-579cf968df-mbb9k traffic-manager  : session_id="1xxxxx"
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-20 11:09:40.5740 error   httpd/conn=127.0.0.1:8081 : Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory
traffic-manager-579cf968df-mbb9k traffic-manager  : session_id="1xxxxx"
traffic-manager-579cf968df-mbb9k traffic-manager 2022-12-20 11:09:50.5823 error   httpd/conn=127.0.0.1:8081 : Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory
traffic-manager-579cf968df-mbb9k traffic-manager  : session_id="1xxxxx"

traffic-agent logs

stern web-68d75fbcdf-dh8cj
+ web-68d75fbcdf-dh8cj › web-server
+ web-68d75fbcdf-dh8cj › traffic-agent
web-68d75fbcdf-dh8cj traffic-agent 2022-12-20 11:09:31.5328 info    Enhanced Traffic Agent v1.13.5
web-68d75fbcdf-dh8cj traffic-agent 2022-12-20 11:09:31.5331 info    Adding exported mounts for container web-server
web-68d75fbcdf-dh8cj traffic-agent 2022-12-20 11:09:31.5332 info    new agent secrets mount path: /var/run/secrets/kubernetes.io
web-68d75fbcdf-dh8cj traffic-agent 2022-12-20 11:09:31.5333 info    LoadConfig: processing environment...
web-68d75fbcdf-dh8cj traffic-agent 2022-12-20 11:09:31.5333 info    LoadConfig: processing annotations...
web-68d75fbcdf-dh8cj traffic-agent 2022-12-20 11:09:31.5333 info    LoadConfig: done
web-68d75fbcdf-dh8cj traffic-agent 2022-12-20 11:09:31.5366 info    LoadConfig: asking traffic-manager for license information...
thallgren commented 1 year ago

@gurpalw here are two things to try:

  1. Ensure that the traffic-agent has debug logging enabled right from the start. This can be done using:
    $ telepresence helm install --upgrade --set agent.logLevel=debug
  2. It's weird that you should get a tel-agent-init container when you're using a symbolic targetPort. You should only get that when the service is headless (it isn't) or when the targetPort is numeric (it isn't). Could it be that you have another service that also maps to this deployment? You can check this by viewing the content of the telepresence-agents in the same namespace as the deployment:
    $ kubectl describe configmap telepresence-agents
gurpalw commented 1 year ago

@thallgren here are the traffic-agent logs with debug enabled:

telepresence intercept web --port 8000 --namespace=dev
telepresence intercept: error: connector.CreateIntercept: request timed out while waiting for agent web.dev to arrive: Events that may be relevant:
AGE   TYPE      REASON      OBJECT                    MESSAGE
13s   Warning   Unhealthy   pod/web-65656f565-6qn2f   Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory

13s   Warning   Unhealthy   pod/web-65656f565-6qn2f   Readiness probe failed: stat: can't stat '/tmp/agent/ready': No such file or directory
+ web-65656f565-6qn2f › traffic-agent
web-65656f565-6qn2f traffic-agent 2022-12-20 12:53:26.0056 info    Enhanced Traffic Agent v1.13.5
web-65656f565-6qn2f traffic-agent 2022-12-20 12:53:26.0060 debug   Logging at this level "debug"
web-65656f565-6qn2f traffic-agent 2022-12-20 12:53:26.0060 info    Adding exported mounts for container web-server
web-65656f565-6qn2f traffic-agent 2022-12-20 12:53:26.0060 debug   checking agent secrets mount path: /var/run/secrets/kubernetes.io
web-65656f565-6qn2f traffic-agent 2022-12-20 12:53:26.0061 debug   checking appmounts directory: /var/run/secrets/kubernetes.io
web-65656f565-6qn2f traffic-agent 2022-12-20 12:53:26.0061 debug   create appmounts directory: /tel_app_exports/web-server/var/run/secrets/kubernetes.io
web-65656f565-6qn2f traffic-agent 2022-12-20 12:53:26.0061 debug   create appmounts symlink: /var/run/secrets/kubernetes.io /tel_app_exports/web-server/var/run/secrets/kubernetes.io
web-65656f565-6qn2f traffic-agent 2022-12-20 12:53:26.0061 info    new agent secrets mount path: /var/run/secrets/kubernetes.io
web-65656f565-6qn2f traffic-agent 2022-12-20 12:53:26.0061 info    LoadConfig: processing environment...
web-65656f565-6qn2f traffic-agent 2022-12-20 12:53:26.0062 info    LoadConfig: processing annotations...
web-65656f565-6qn2f traffic-agent 2022-12-20 12:53:26.0062 info    LoadConfig: done
web-65656f565-6qn2f traffic-agent 2022-12-20 12:53:26.0096 info    LoadConfig: asking traffic-manager for license information...

the logs stop there.

here is the output of the configmap:

kubectl describe configmap telepresence-agents -ndev
Name:         telepresence-agents
Namespace:    dev
Labels:       app.kubernetes.io/created-by=traffic-manager
              app.kubernetes.io/name=telepresence-agents
              app.kubernetes.io/version=2.9.5
Annotations:  <none>

Data
====
web:
----
EnvoyAdminPort: 19000
EnvoyLogLevel: warning
EnvoyServerPort: 18000
agentImage: docker.io/datawire/ambassador-telepresence-agent:1.13.5
agentName: web
containers:
- Mounts: null
  envPrefix: A_
  intercepts:
  - agentPort: 9900
    containerPort: 8000
    containerPortName: http
    protocol: TCP
    serviceName: web
    servicePort: 8000
    serviceUID: 4023d7fc-4dcb-466f-b0bf-e26523ddd964
  mountPoint: /tel_app_mounts/web-server
  name: web-server
logLevel: debug
managerHost: traffic-manager.ambassador
managerPort: 8081
namespace: dev
tracingPort: 15766
workloadKind: Deployment
workloadName: web

BinaryData
====

Events:  <none>
thallgren commented 1 year ago

@gurpalw is that the configmap you're using when you get a tel-init-container injected?

thallgren commented 1 year ago

The log proves that the traffic-agent is unable to contact the traffic-manager at traffic-manager.ambassador:8081. The connection attempt is hanging indefinitely. I know this because any error from that attempt would be logged, and a success would result in the next log message which is "LoadConfig: asking traffic-manager for Ambassador Cloud configuration...". It never shows up.

gurpalw commented 1 year ago

here is the service yaml.

Name:              traffic-manager
Namespace:         ambassador
Labels:            app=traffic-manager
                   app.kubernetes.io/created-by=Helm
                   app.kubernetes.io/managed-by=Helm
                   app.kubernetes.io/version=2.9.5
                   helm.sh/chart=telepresence-2.9.5
                   telepresence=manager
Annotations:       meta.helm.sh/release-name: traffic-manager
                   meta.helm.sh/release-namespace: ambassador
Selector:          app=traffic-manager,telepresence=manager
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                None
IPs:               None
Port:              api  8081/TCP
TargetPort:        api/TCP
Endpoints:         100.69.107.175:8081
Port:              grpc-trace  15766/TCP
TargetPort:        grpc-trace/TCP
Endpoints:         100.69.107.175:15766
Session Affinity:  None
Events:            <none>

i have launched a test ubuntu pod in the target namespace(dev) and i can reach services via dns in other namespaces:

root@gurps:/# curl loki-distributed-querier.logging.svc.cluster.local:3100
404 page not found

but i cannot reach the traffic manager service when curling the dns.

curl traffic-manager.ambassador.svc.cluster.local:8081
curl: (6) Could not resolve host: traffic-manager.ambassador.svc.cluster.local

however, i can reach the endpoint listed in the service description above when curling the ip directly.

root@gurps:/# curl 100.69.107.175:8081
Hello World from: /

here i dig another service to prove dns is working in the cluster.

root@gurps:/# dig loki-distributed-querier.logging.svc.cluster.local

; <<>> DiG 9.18.1-1ubuntu1.2-Ubuntu <<>> loki-distributed-querier.logging.svc.cluster.local
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16581
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 20e9c56df32eca8f (echoed)
;; QUESTION SECTION:
;loki-distributed-querier.logging.svc.cluster.local. IN A

;; ANSWER SECTION:
loki-distributed-querier.logging.svc.cluster.local. 5 IN A 10.100.16.31

;; Query time: 0 msec
;; SERVER: 10.100.0.10#53(10.100.0.10) (UDP)
;; WHEN: Tue Dec 20 14:42:42 UTC 2022
;; MSG SIZE  rcvd: 157

and here i do the same thing but with the traffic-manager. you can see that it doesn't respond with an answer section

root@gurps:/# dig traffic-manager.ambassador.svc.cluster.local

; <<>> DiG 9.18.1-1ubuntu1.2-Ubuntu <<>> traffic-manager.ambassador.svc.cluster.local
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 8464
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: c85dfbb0a6cb029e (echoed)
;; QUESTION SECTION:
;traffic-manager.ambassador.svc.cluster.local. IN A

;; AUTHORITY SECTION:
cluster.local.      5   IN  SOA ns.dns.cluster.local. hostmaster.cluster.local. 1671541000 7200 1800 86400 5

;; Query time: 0 msec
;; SERVER: 10.100.0.10#53(10.100.0.10) (UDP)
;; WHEN: Tue Dec 20 14:42:28 UTC 2022
;; MSG SIZE  rcvd: 178
thallgren commented 1 year ago

@gurpalw the answer section is provided by the DNS service, not the actual service that you dig for. So the question to ask is, why doesn't the cluster's DNS service serve up the traffic-manager's IP? The traffic-manager itself play no part in this.

gurpalw commented 1 year ago

@thallgren

problem solved, my coredns was broken! fixed. thanks @thallgren and sorry for taking up so much of your time!

mwilniewiec commented 1 year ago

Thats a lot of reading but gave me nothing :) Im trying to setup v2.11.1 with same error, I didn't try older versions.

telepresence intercept dev-device-api --port 8080:8080 --env-file tele.env results in: connector.CreateIntercept: request timed out while waiting for agent dev-device-api.dev to arrive

logs with loglevel trace: telepresence_logs.zip

mruell commented 1 year ago

Same here

This is what I see in the traffic-agent

2023-03-10 16:08:13.1345 info    LoadConfig: processing environment...                                                                                                                                                                                      │
│ 2023-03-10 16:08:13.1345 info    LoadConfig: done                                                                                                                                                                                                           │
│ 2023-03-10 16:08:13.1375 info    LoadConfig: asking traffic-manager for license information...                                                                                                                                                              │
│ 2023-03-10 16:08:15.5402 error   error(s) getting license: license not found                                                                                                                                                                                │
│ 2023-03-10 16:08:15.5402 info    LoadConfig: asking traffic-manager for Ambassador Cloud configuration...                                                                                                                                                   │
│ 2023-03-10 16:08:15.5480 info    trying to connect to ambassador cloud at app.getambassador.io:443                                                                                                                                                          │
│ 2023-03-10 16:08:15.5520 info    connected to ambassador cloud at app.getambassador.io:443                                                                                                                                                                  │
│ 2023-03-10 16:08:15.5529 info    ftp-server : FTP server listening on :40463                                                                                                                                                                                │
│ 2023-03-10 16:08:15.5532 info    envoy-server : Management server listening on 18000                                                                                                                                                                        │
│                                                                                                                                                                                                                                                             │
│ 2023-03-10 16:08:15.5532 info    envoy : started command ["envoy" "-l" "warning" "-c" "bootstrap-ads.pb" "--base-id" "1"] : dexec.pid="24"                                                                                                                  │
│ 2023-03-10 16:08:15.5536 info    envoy :  : dexec.pid="24" dexec.stream="stdin" dexec.err="EOF"                                                                                                                                                             │
│ 2023-03-10 16:08:15.6045 info    envoy :  : dexec.pid="24" dexec.stream="stdout+stderr" dexec.data="[2023-03-10 16:08:15.604][24][critical][backtrace] [./source/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x7ffe22b37fe8 │
│ 2023-03-10 16:08:15.6045 info    envoy :  : dexec.pid="24" dexec.stream="stdout+stderr" dexec.data="[2023-03-10 16:08:15.604][24][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):\n"       │
│ 2023-03-10 16:08:15.6045 info    envoy :  : dexec.pid="24" dexec.stream="stdout+stderr" dexec.data="[2023-03-10 16:08:15.604][24][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: c5b7729c6ca304d1d8961790e8379358b6f1e233/1.23.4/Clea │
│ 2023-03-10 16:08:15.6046 info    envoy :  : dexec.pid="24" dexec.stream="stdout+stderr" dexec.data="[2023-03-10 16:08:15.604][24][critical][backtrace] [./source/server/backtrace.h:98] #0: [0x7f92eafb5acd]\n"                                             │
│ 2023-03-10 16:08:17.4586 info    envoy : finished with error: signal: segmentation fault (core dumped) : dexec.pid="24"                                                                                                                                     │
│ 2023-03-10 16:08:17.4586 error   envoy : goroutine "/envoy" exited with error: signal: segmentation fault (core dumped)
mruell commented 1 year ago

Hi @mwilniewiec

while we wait for a fix I fixed the issue for now by downgrading the traffic-agent image! To do this, just downgrade the traffic manager with these values

File: traffic-manager.yaml

agent:
  image:
    name: ambassador-telepresence-agent
    tag: 1.13.9

Apply via telepresence helm upgrade -f traffic-manager.yaml

mruell commented 1 year ago

Opened an issue for that: https://github.com/telepresenceio/telepresence/issues/3069

sambonbonne commented 1 year ago

IDK if this will help but it seems upgrading to Telepresence 2.16.1 (Kubernetes-side) helped for me. After upgrade, I ran telepresence quit && telepresence connect before running my intercept, and it worked fine.

RobertFischer commented 1 year ago

I'm on 2.16.1 and seeing this issue. I created an intercept to a shell (ie: passed -- /bin/bash) and then rebooted. Now I get this error when attempting to repeat the intercept. Quitting and connecting doesn't change anything. Logs attached. telepresence_logs.zip

vcellu commented 9 months ago

I'm seeing similar issues. Not much information provided as to why it is crashing.

telepresence_logs.zip

thallgren commented 9 months ago

I'm closing this issue because it's all over the place, including a "problem solved, my coredns was broken". @vcellu , if you experience a crash, I suggest you open a new ticket.

Some hints for you before you do that: