Open zongjiangU opened 2 years ago
Vizier updated to 0.9.16, pixie still not work.I got the below msg in pl__vizier-cloud-connector.log
time="2021-12-13T09:51:54Z" level=info msg="health check returned unhealthy" error="missing vizier healthcheck response"
time="2021-12-13T09:51:54Z" level=info msg="HTTP Request" req_method=GET req_path=/statusz resp_code=503 resp_size=17 resp_status="Service Unavailable" resp_time="135.996µs"
time="2021-12-13T09:51:55Z" level=info msg="failed vizier health check, will restart healthcheck" error="rpc error: code = Unavailable desc = last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.106.176.80:50300: connect: connection refused\""time="2021-12-13T09:51:55Z" level=info msg="[core] Subchannel Connectivity change to CONNECTING" system=systemtime="2021-12-13T09:51:55Z" level=info msg="[core] Subchannel picks a new address \"vizier-query-broker-svc.pl.svc:50300\" to connect" system=systemtime="2021-12-13T09:51:55Z" level=info msg="[balancer] base.baseBalancer: handle SubConn state change: 0xc0004fab50, CONNECTING" system=systemtime="2021-12-13T09:51:55Z" level=info msg="[core] Subchannel Connectivity change to READY" system=system
time="2021-12-13T09:51:55Z" level=info msg="[balancer] base.baseBalancer: handle SubConn state change: 0xc0004fab50, READY" system=systemtime="2021-12-13T09:51:55Z" level=info msg="[roundrobin] roundrobinPicker: newPicker called with info: {map[0xc0004fab50:{{vizier-query-broker-svc.pl.svc:50300 <nil> 0 <nil>}}]}" system=systemtime="2021-12-13T09:51:55Z" level=info msg="[core] Channel Connectivity change to READY" system=system
time="2021-12-13T09:52:14Z" level=info msg="health check returned unhealthy" error="vizier healthcheck is failing: "
time="2021-12-13T09:52:14Z" level=info msg="HTTP Request" req_method=GET req_path=/statusz resp_code=503 resp_size=17 resp_status="Service Unavailable" resp_time="89.953µs"
I check vizier-query-borker is good
[root@vm183185 pixie]# kubectl get svc -n pl -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
kelvin-service ClusterIP None <none> 59300/TCP 21m app=pl-monitoring,component=vizier,name=kelvin
pl-etcd ClusterIP None <none> 2379/TCP,2380/TCP 21m app=pl-monitoring,etcd_cluster=pl-etcd
pl-etcd-client ClusterIP 10.111.176.179 <none> 2379/TCP 21m app=pl-monitoring,etcd_cluster=pl-etcd
pl-nats ClusterIP 10.98.184.18 <none> 4222/TCP 21m app=pl-monitoring,name=pl-nats
pl-nats-mgmt ClusterIP None <none> 6222/TCP,8222/TCP,7777/TCP,7422/TCP,7522/TCP 21m app=pl-monitoring,name=pl-nats
vizier-certmgr-svc ClusterIP 10.110.103.154 <none> 50900/TCP 21m app=pl-monitoring,component=vizier,name=vizier-certmgr
vizier-cloud-connector-svc ClusterIP 10.104.84.224 <none> 50800/TCP 21m app=pl-monitoring,component=vizier,name=vizier-cloud-connector,vizier-bootstrap=true
vizier-metadata-svc ClusterIP 10.106.103.42 <none> 50400/TCP 20m app=pl-monitoring,component=vizier,name=vizier-metadata
vizier-proxy-service NodePort 10.110.116.154 <none> 443:31721/TCP 20m app=pl-monitoring,component=vizier,name=vizier-proxy
vizier-query-broker-svc ClusterIP 10.106.176.80 <none> 50300/TCP,50305/TCP 20m app=pl-monitoring,component=vizier,name=vizier-query-broker
[root@vm183185 pixie]# kubectl get pod -n pl -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kelvin-5b66bcdd9f-fkmns 1/1 Running 0 30m 10.168.104.239 vm183186 <none> <none>
pl-etcd-0 1/1 Running 0 30m 10.168.16.15 vm183184 <none> <none>
pl-etcd-1 1/1 Running 3 (29m ago) 30m 10.168.175.39 vm183185 <none> <none>
pl-etcd-2 1/1 Running 1 (30m ago) 30m 10.168.175.28 vm183185 <none> <none>
pl-nats-0 1/1 Running 0 30m 10.168.104.212 vm183186 <none> <none>
vizier-certmgr-558bb8675-6lxhn 1/1 Running 0 30m 10.168.104.216 vm183186 <none> <none>
vizier-cloud-connector-6f67899c57-j75r5 1/1 Running 0 30m 10.168.16.49 vm183184 <none> <none>
vizier-metadata-8b5847f56-5vbmk 1/1 Running 0 30m 10.168.16.22 vm183184 <none> <none>
vizier-pem-4d6pr 1/1 Running 0 30m 10.215.183.186 vm183186 <none> <none>
vizier-pem-h4htn 1/1 Running 0 30m 10.215.183.185 vm183185 <none> <none>
vizier-pem-z5ntf 1/1 Running 0 30m 10.215.183.184 vm183184 <none> <none>
vizier-proxy-6f79d9897f-tdtnv 1/1 Running 0 30m 10.168.175.53 vm183185 <none> <none>
vizier-query-broker-6f97bf454b-g8ng2 1/1 Running 0 30m 10.168.16.20 vm183184 <none> <none>
[root@vm183185 pixie]#
run the script get nothing
[root@vm183186 mnt]# px run px/namepaces
Pixie CLI
*******************************
* IN TESTING MODE
* PL_CLOUD_ADDR=dev.withpixie.dev
*******************************
Failed to get script error=Script not found
[root@vm183186 mnt]# px run px/namespaces
Pixie CLI
*******************************
* IN TESTING MODE
* PL_CLOUD_ADDR=dev.withpixie.dev
*******************************
Table ID: Process Stats Overview by Namespace
NAMESPACE AVG VSIZE AVG RSS ACTUAL DISK READ THROUGHPUT ACTUAL DISK WRITE THROUGHPUT TOTAL DISK READ THROUGHPUT TOTAL DISK WRITE THROUGHPUT
Table ID: Namespaces
NAMESPACE POD COUNT SERVICE COUNT
==> Live UI: https://work.dev.withpixie.dev:443/live/clusters/kubernetes-admin@kubernetes?script=px%2Fnamespaces&start_time=-5m
[root@vm183186 mnt]#
How should I troubleshoot this problem?
There are two suspicious items in the logs: 1) GRPC related errors:
E1213 10:11:27.587925964 44355 chttp2_transport.cc:1080] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"
[0;31mE1213 10:11:27.588690 44355 grpc_sink_node.cc:202] GRPCSinkNode 254 in query 184df8a9-4710-407f-8498-a9c39e2645a8: Error calling Finish on stream, message: Socket closed
[mI1213 10:11:27.588757 44355 grpc_sink_node.cc:215] Closing GRPCSinkNode 252 in query 184df8a9-4710-407f-8498-a9c39e2645a8 before receiving EOS
[0;31mE1213 10:11:27.588794 44355 grpc_sink_node.cc:202] GRPCSinkNode 252 in query 184df8a9-4710-407f-8498-a9c39e2645a8: Error calling Finish on stream, message: Too many pings
[m[0;31mE1213 10:11:27.589136 44355 exec.cc:59] Query 184df8a9-4710-407f-8498-a9c39e2645a8 failed, reason: Internal : GRPCSinkNode 254 encountered error sending stream to address 10.168.104.239:59300,
Adding @aimichelle for that part.
2) Cgroup path resolver.
[0;31mE1213 09:52:37.214429 32544 cgroup_metadata_reader.cc:44] Failed to create path resolver. Falling back to legacy path resolver. [error = Not Found : Could not find self as a template.]
Describe the bug I got nothing when running script except bpftrace. The vizier and agent seem like health. Self-Hosted Pixie installed by "px deploy --dev_cloud_namespace plc --deploy_olm=false"
Screenshots
Logs pixie_logs_20211201072955.zip
App information (please complete the following information):
Additional context I found the health check is stuck here deploying Self-Hosted Pixie. It look lie the cloudAddr(api-service.plc.svc.cluster.local:51200) is non-accessible,so i commented out the code, then deploy success. But it can access this domain name normally in other pod.
I am not sure if this operation caused the current problem. I redeployed Pixie by unmodified Pixie's CLI, health check failed, run script get nothing.