pixie-io / pixie

Instant Kubernetes-Native Application Observability
https://px.dev
Apache License 2.0
5.6k stars 430 forks source link

Cannot get valid information when running "px run px/namespaces" #369

Open zongjiangU opened 2 years ago

zongjiangU commented 2 years ago

Describe the bug I got nothing when running script except bpftrace. The vizier and agent seem like health. Self-Hosted Pixie installed by "px deploy --dev_cloud_namespace plc --deploy_olm=false"

Screenshots

[root@vm183185 pixie_cli]# px run px/namespaces
Pixie CLI
*******************************
* IN TESTING MODE
*    PL_CLOUD_ADDR=dev.withpixie.dev
*******************************
Table ID: Namespaces
  NAMESPACE  POD COUNT  SERVICE COUNT  
Table ID: Process Stats Overview by Namespace
  NAMESPACE  AVG VSIZE  AVG RSS  ACTUAL DISK READ THROUGHPUT  ACTUAL DISK WRITE THROUGHPUT  TOTAL DISK READ THROUGHPUT  TOTAL DISK WRITE THROUGHPUT  

==>  Live UI: https://work.dev.withpixie.dev:443/live/clusters/kubernetes-admin@kubernetes?script=px%2Fnamespaces&start_time=-5m
[root@vm183185 pixie_cli]# px run px/agent_status
Pixie CLI
*******************************
* IN TESTING MODE
*    PL_CLOUD_ADDR=dev.withpixie.dev
*******************************
Table ID: output
  AGENT ID                              ASID  HOSTNAME                 IP ADDRESS          AGENT STATE          CREATE TIME                              LAST HEARTBEAT NS  
  172d7fa8-f824-4cb1-89eb-dcdbfd952756  3     vm183184                                     AGENT_STATE_HEALTHY  2021-12-01 01:53:11.10538375 +0000 GMT   2390995972         
  865624ac-f170-4e79-a748-37d8cc02ee2c  1     kelvin-745846bfc6-bglsx  10.168.16.39:59300  AGENT_STATE_HEALTHY  2021-12-01 01:52:27.294379787 +0000 GMT  1623646045         
  a58193bf-4f55-4853-9132-f0bec7519821  4     vm183186                                     AGENT_STATE_HEALTHY  2021-12-01 01:53:15.766434811 +0000 GMT  2588635798         
  bb23c1fe-c274-46a2-ad16-abfe088155b3  2     vm183185                                     AGENT_STATE_HEALTHY  2021-12-01 01:53:04.856278598 +0000 GMT  3641795561         

==>  Live UI: https://work.dev.withpixie.dev:443/live/clusters/kubernetes-admin@kubernetes?script=px%2Fagent_status
[root@vm183185 pixie_cli]# px get viziers
Pixie CLI
*******************************
* IN TESTING MODE
*    PL_CLOUD_ADDR=dev.withpixie.dev
*******************************
Table ID: viziers
  CLUSTERNAME                  ID                                    K8S VERSION  VIZIER VERSION  LAST HEARTBEAT  PASSTHROUGH  STATUS      STATUS MESSAGE  
  kubernetes-admin@kubernetes  8bd86533-078f-4932-a2f1-8c071c57946a  v1.22.4      0.9.11          1 second ago    true         CS_HEALTHY 
[root@vm183185 pixie_cli]# kubectl logs vizier-pem-27qt7 -n pl
I1201 07:27:18.627548 84708 exec.cc:91] Queries in flight: 0
I1201 07:27:18.627647  7701 exec.cc:51] Executing query: id=e707aeca-26a1-4dfe-bd2d-147c8dc8fe52
I1201 07:27:18.645769  7701 exec.cc:63] Completed query: id=e707aeca-26a1-4dfe-bd2d-147c8dc8fe52
W1201 07:27:23.586112 84708 state_manager.cc:262] Failed to read PID info for pod=6a668241-b2aa-46aa-8626-07a7c6cd0f54, cid= [msg=Failed to open file /sys/fs/cgroup/pids/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6a668241_b2aa_46aa_8626_07a7c6cd0f54.slice/docker-.scope/cgroup.procs]

Logs pixie_logs_20211201072955.zip

App information (please complete the following information):

Additional context I found the health check is stuck here deploying Self-Hosted Pixie. It look lie the cloudAddr(api-service.plc.svc.cluster.local:51200) is non-accessible,so i commented out the code, then deploy success. But it can access this domain name normally in other pod.

sh-4.2$ curl -v https://api-service.plc.svc.cluster.local:51200/healthz -k
* About to connect() to api-service.plc.svc.cluster.local port 51200 (#0)
*   Trying 10.100.228.181...
* Connected to api-service.plc.svc.cluster.local (10.100.228.181) port 51200 (#0)
* Initializing NSS with certpath: sql:/etc/pki/nssdb
* skipping SSL peer certificate verification
* SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate:
*   subject: CN=pixie.local,O=Pixie
*   start date: Dec 01 01:24:44 2021 GMT
*   expire date: Dec 01 01:24:44 2022 GMT
*   common name: pixie.local
*   issuer: CN=pixie.local,O=Pixie
> GET /healthz HTTP/1.1
> User-Agent: curl/7.29.0
> Host: api-service.plc.svc.cluster.local:51200
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Thu, 02 Dec 2021 02:17:43 GMT
< Content-Length: 35
< Content-Type: text/plain; charset=utf-8
< 
OK
[+]ping OK
healthz check passed
* Connection #0 to host api-service.plc.svc.cluster.local left intact
sh-4.2$ exit

I am not sure if this operation caused the current problem. I redeployed Pixie by unmodified Pixie's CLI, health check failed, run script get nothing.

zongjiangU commented 2 years ago

Vizier updated to 0.9.16, pixie still not work.I got the below msg in pl__vizier-cloud-connector.log

time="2021-12-13T09:51:54Z" level=info msg="health check returned unhealthy" error="missing vizier healthcheck response"
time="2021-12-13T09:51:54Z" level=info msg="HTTP Request" req_method=GET req_path=/statusz resp_code=503 resp_size=17 resp_status="Service Unavailable" resp_time="135.996µs"
time="2021-12-13T09:51:55Z" level=info msg="failed vizier health check, will restart healthcheck" error="rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.106.176.80:50300: connect: connection refused\""time="2021-12-13T09:51:55Z" level=info msg="[core] Subchannel Connectivity change to CONNECTING" system=systemtime="2021-12-13T09:51:55Z" level=info msg="[core] Subchannel picks a new address \"vizier-query-broker-svc.pl.svc:50300\" to connect" system=systemtime="2021-12-13T09:51:55Z" level=info msg="[balancer] base.baseBalancer: handle SubConn state change: 0xc0004fab50, CONNECTING" system=systemtime="2021-12-13T09:51:55Z" level=info msg="[core] Subchannel Connectivity change to READY" system=system
time="2021-12-13T09:51:55Z" level=info msg="[balancer] base.baseBalancer: handle SubConn state change: 0xc0004fab50, READY" system=systemtime="2021-12-13T09:51:55Z" level=info msg="[roundrobin] roundrobinPicker: newPicker called with info: {map[0xc0004fab50:{{vizier-query-broker-svc.pl.svc:50300  <nil> 0 <nil>}}]}" system=systemtime="2021-12-13T09:51:55Z" level=info msg="[core] Channel Connectivity change to READY" system=system
time="2021-12-13T09:52:14Z" level=info msg="health check returned unhealthy" error="vizier healthcheck is failing: "
time="2021-12-13T09:52:14Z" level=info msg="HTTP Request" req_method=GET req_path=/statusz resp_code=503 resp_size=17 resp_status="Service Unavailable" resp_time="89.953µs"

I check vizier-query-borker is good

[root@vm183185 pixie]# kubectl get svc -n pl -o wide
NAME                         TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                        AGE   SELECTOR
kelvin-service               ClusterIP   None             <none>        59300/TCP                                      21m   app=pl-monitoring,component=vizier,name=kelvin
pl-etcd                      ClusterIP   None             <none>        2379/TCP,2380/TCP                              21m   app=pl-monitoring,etcd_cluster=pl-etcd
pl-etcd-client               ClusterIP   10.111.176.179   <none>        2379/TCP                                       21m   app=pl-monitoring,etcd_cluster=pl-etcd
pl-nats                      ClusterIP   10.98.184.18     <none>        4222/TCP                                       21m   app=pl-monitoring,name=pl-nats
pl-nats-mgmt                 ClusterIP   None             <none>        6222/TCP,8222/TCP,7777/TCP,7422/TCP,7522/TCP   21m   app=pl-monitoring,name=pl-nats
vizier-certmgr-svc           ClusterIP   10.110.103.154   <none>        50900/TCP                                      21m   app=pl-monitoring,component=vizier,name=vizier-certmgr
vizier-cloud-connector-svc   ClusterIP   10.104.84.224    <none>        50800/TCP                                      21m   app=pl-monitoring,component=vizier,name=vizier-cloud-connector,vizier-bootstrap=true
vizier-metadata-svc          ClusterIP   10.106.103.42    <none>        50400/TCP                                      20m   app=pl-monitoring,component=vizier,name=vizier-metadata
vizier-proxy-service         NodePort    10.110.116.154   <none>        443:31721/TCP                                  20m   app=pl-monitoring,component=vizier,name=vizier-proxy
vizier-query-broker-svc      ClusterIP   10.106.176.80    <none>        50300/TCP,50305/TCP                            20m   app=pl-monitoring,component=vizier,name=vizier-query-broker
[root@vm183185 pixie]# kubectl get pod -n pl -o wide
NAME                                      READY   STATUS    RESTARTS      AGE   IP               NODE       NOMINATED NODE   READINESS GATES
kelvin-5b66bcdd9f-fkmns                   1/1     Running   0             30m   10.168.104.239   vm183186   <none>           <none>
pl-etcd-0                                 1/1     Running   0             30m   10.168.16.15     vm183184   <none>           <none>
pl-etcd-1                                 1/1     Running   3 (29m ago)   30m   10.168.175.39    vm183185   <none>           <none>
pl-etcd-2                                 1/1     Running   1 (30m ago)   30m   10.168.175.28    vm183185   <none>           <none>
pl-nats-0                                 1/1     Running   0             30m   10.168.104.212   vm183186   <none>           <none>
vizier-certmgr-558bb8675-6lxhn            1/1     Running   0             30m   10.168.104.216   vm183186   <none>           <none>
vizier-cloud-connector-6f67899c57-j75r5   1/1     Running   0             30m   10.168.16.49     vm183184   <none>           <none>
vizier-metadata-8b5847f56-5vbmk           1/1     Running   0             30m   10.168.16.22     vm183184   <none>           <none>
vizier-pem-4d6pr                          1/1     Running   0             30m   10.215.183.186   vm183186   <none>           <none>
vizier-pem-h4htn                          1/1     Running   0             30m   10.215.183.185   vm183185   <none>           <none>
vizier-pem-z5ntf                          1/1     Running   0             30m   10.215.183.184   vm183184   <none>           <none>
vizier-proxy-6f79d9897f-tdtnv             1/1     Running   0             30m   10.168.175.53    vm183185   <none>           <none>
vizier-query-broker-6f97bf454b-g8ng2      1/1     Running   0             30m   10.168.16.20     vm183184   <none>           <none>
[root@vm183185 pixie]# 
run the script get nothing
[root@vm183186 mnt]# px run px/namepaces
Pixie CLI
*******************************
* IN TESTING MODE
*    PL_CLOUD_ADDR=dev.withpixie.dev
*******************************
Failed to get script error=Script not found
[root@vm183186 mnt]# px run px/namespaces
Pixie CLI
*******************************
* IN TESTING MODE
*    PL_CLOUD_ADDR=dev.withpixie.dev
*******************************
Table ID: Process Stats Overview by Namespace
  NAMESPACE  AVG VSIZE  AVG RSS  ACTUAL DISK READ THROUGHPUT  ACTUAL DISK WRITE THROUGHPUT  TOTAL DISK READ THROUGHPUT  TOTAL DISK WRITE THROUGHPUT  
Table ID: Namespaces
  NAMESPACE  POD COUNT  SERVICE COUNT  

==>  Live UI: https://work.dev.withpixie.dev:443/live/clusters/kubernetes-admin@kubernetes?script=px%2Fnamespaces&start_time=-5m
[root@vm183186 mnt]# 

pixie_logs_20211213103207.zip

How should I troubleshoot this problem?

oazizi000 commented 2 years ago

There are two suspicious items in the logs: 1) GRPC related errors:

E1213 10:11:27.587925964   44355 chttp2_transport.cc:1080]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
[0;31mE1213 10:11:27.588690 44355 grpc_sink_node.cc:202] GRPCSinkNode 254 in query 184df8a9-4710-407f-8498-a9c39e2645a8: Error calling Finish on stream, message: Socket closed
[mI1213 10:11:27.588757 44355 grpc_sink_node.cc:215] Closing GRPCSinkNode 252 in query 184df8a9-4710-407f-8498-a9c39e2645a8 before receiving EOS
[0;31mE1213 10:11:27.588794 44355 grpc_sink_node.cc:202] GRPCSinkNode 252 in query 184df8a9-4710-407f-8498-a9c39e2645a8: Error calling Finish on stream, message: Too many pings
[m[0;31mE1213 10:11:27.589136 44355 exec.cc:59] Query 184df8a9-4710-407f-8498-a9c39e2645a8 failed, reason: Internal : GRPCSinkNode 254 encountered error sending stream to address 10.168.104.239:59300,

Adding @aimichelle for that part.

2) Cgroup path resolver.

E1213 09:52:37.214429 32544 cgroup_metadata_reader.cc:44] Failed to create path resolver. Falling back to legacy path resolver. [error = Not Found : Could not find self as a template.]