ztdawang commented 1 year ago

vizier-pem runs with errors in the following environments: debian 12.0+k8s 1.26+cilium 1.13.4+cgroup v2

W20230720 02:33:23.604429 521629 state_manager.cc:277] Failed to read PID info for pod=f3c1efa9-5f0c-4d7c-b79d-f335e2d4ce7b, cid= [msg=Failed to open file /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf3c1efa9_5f0c_4d7c_b79d_f335e2d4ce7b.slice/cri-containerd-.scope/cgroup.procs]

W20230720 02:33:23.605147 521629 state_manager.cc:277] Failed to read PID info for pod=80f4c2f9-2479-4e6d-95a5-bbcf3a026001, cid=d5115e549992716ada3ae413ae3311be5ad285feb6762ca33f294dce3aa22e5d [msg=Failed to open file /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod80f4c2f9_2479_4e6d_95a5_bbcf3a026001.slice/cri-containerd-d5115e549992716ada3ae413ae3311be5ad285feb6762ca33f294dce3aa22e5d.scope/cgroup.procs]

kubectl -n pl logs -f vizier-metadata-dc587cd79-8c6h9

time="2023-07-20T04:34:49Z" level=info msg="[transport] transport: loopyWriter.run returning. connection error: desc = \"transport is closing\"" system=system

The other question is: does pixie support cilium dsr mode, i.e. tunnel=disabled??

ddelnano commented 1 year ago

Hi @ztdawang, can you confirm that the output above is the full log? If not, please provide the entire thing.

Can you explain what behavior you are expecting to see? My understanding is that this error message in isolation is not a problem (short lived processes/pods do happen and may cause this).

does pixie support cilium dsr mode, i.e. tunnel=disabled??

I haven't used cilium before, but from my brief research I believe Pixie should work with dsr mode. For many of Pixie's visualizations (pxl scripts), we provide the source by its pod name. That association may not work but the protocol tracing should still see all requests and responses to and from a given pod.

ztdawang commented 1 year ago

The phenomenon is that all of pixie's scripts can't get observability data.

ddelnano commented 1 year ago

@ztdawang what protocol traffic are you expecting to see? I'm not sure I see anything that points to a debian specific issue. The more details you can provide about what is running on the cluster and what protocol data should be visible will advise where to take a deeper look.

ztdawang commented 1 year ago

Can't get any data on webui. I deployed pixie in self-host mode.

ddelnano commented 1 year ago

Are you able to explain what processes you have running on the machine that pixie is compatible with? Knowing what data is missing (pgsql, mysql, http), whether it's encrypted or not and the language it is written in would help guide where we can look into things.

ztdawang commented 1 year ago

My environment is a new k8s cluster without any application deployed. All pods under the plc and olm namespaces run normally, but under the pl namespace, vizter-pem, which is the agent, reports an error related to the screenshot above and cgroup2. There are two other pods that restart every few tens of minutes because they failed the health check. I can’t remember the name of the pod because I’m not in front of the computer. Now, I only remember the two pods above vizter-pem

ddelnano commented 1 year ago

At the moment, I don't believe the logs shared above (cgroup2) are indicating a problem. However, the other crashing pods are something to look further into. Would you be able to provide kubectl -n pl get pods and kubectl describe the crashing pods?

ztdawang commented 1 year ago

The biggest problem may be that the data is not transferred properly between the agent and the cloud

$ kubectl -n pl logs -f vizier-cloud-connector-5987cf847-k85rh

time="2023-07-21T21:09:59Z" level=info msg="[core] Channel authority set to \"vzconn-service.plc.svc.cluster.local:51600\"" system=system time="2023-07-21T21:09:59Z" level=info msg="[core] ccResolverWrapper: sending update to cc: {[{vzconn-service.plc.svc.cluster.local:51600 0 }] }" system=system time="2023-07-21T21:09:59Z" level=info msg="[core] ClientConn switching balancer to \"pick_first\"" system=system time="2023-07-21T21:09:59Z" level=info msg="[core] Channel switches to new LB policy \"pick_first\"" system=system time="2023-07-21T21:09:59Z" level=info msg="[core] Subchannel Connectivity change to CONNECTING" system=system time="2023-07-21T21:09:59Z" level=info msg="[core] Subchannel picks a new address \"vzconn-service.plc.svc.cluster.local:51600\" to connect" system=system time="2023-07-21T21:09:59Z" level=info msg="[core] pickfirstBalancer: UpdateSubConnState: 0xc0004cef60, {CONNECTING }" system=system time="2023-07-21T21:09:59Z" level=info msg="[core] Channel Connectivity change to CONNECTING" system=system time="2023-07-21T21:09:59Z" level=info msg="[core] Subchannel Connectivity change to READY" system=system time="2023-07-21T21:09:59Z" level=info msg="[core] pickfirstBalancer: UpdateSubConnState: 0xc0004cef60, {READY }" system=system time="2023-07-21T21:09:59Z" level=info msg="[core] Channel Connectivity change to READY" system=system time="2023-07-21T21:09:59Z" level=info msg="Successfully connected to Pixie Cloud via VZConn" time="2023-07-21T21:09:59Z" level=info msg="Connecting to NATS..." time="2023-07-21T21:09:59Z" level=info msg="Successfully connected to NATS" time="2023-07-21T21:09:59Z" level=info msg="Starting NATS bridge."

$ kubectl -n pl describe po vizier-cloud-connector-5987cf847-k85rh Port: 50800/TCP Host Port: 0/TCP State: Running Started: Sat, 22 Jul 2023 05:09:56 +0800 Last State: Terminated Reason: Error Exit Code: 1 Started: Sat, 22 Jul 2023 04:47:25 +0800 Finished: Sat, 22 Jul 2023 05:09:55 +0800 Ready: True Restart Count: 43

$kubectl -n pl logs -f vizier-metadata-dc587cd79-r66bc --previous time="2023-07-21T21:29:55Z" level=info msg="[transport] transport: loopyWriter.run returning. connection error: desc = \"transport is closing\"" system=system E0721 21:32:26.969770 1 leaderelection.go:367] Failed to update lock: resource name may not be empty I0721 21:32:26.969817 1 leaderelection.go:283] failed to renew lease pl/metadata-election: timed out waiting for the condition time="2023-07-21T21:32:33Z" level=warning msg="Leadership lost. This can occur when the K8s API has heavy resource utilization or high network latency and fails to respond within 1875ms. This usually resolves by itself after some time. Terminating to retry..."

$ kubectl -n pl describe po vizier-metadata-dc587cd79-r66bc Port: Host Port: State: Running Started: Sat, 22 Jul 2023 05:32:34 +0800 Last State: Terminated Reason: Error Exit Code: 1 Started: Sat, 22 Jul 2023 05:09:55 +0800 Finished: Sat, 22 Jul 2023 05:32:33 +0800 Ready: True Restart Count: 43 Limits: cpu: 1 memory: 8G Requests: cpu: 1 memory: 8G Liveness: http-get https://:50400/healthz delay=120s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get https://:50400/healthz delay=30s timeout=1s period=10s #success=1 #failure=5

ztdawang commented 1 year ago

Redeployed the pixie agents again, the result is still the same. the list of Agents displayed on the webui is empty, and there is no data on the LIVE VIEW.

ztdawang commented 1 year ago

I installed other open-source eBPF-based observability tool and everything works fine. This shows that my k8s cluster is definitely fine.

neilkuan commented 11 months ago

I installed other open-source eBPF-based observability tool and everything works fine. This shows that my k8s cluster is definitely fine.我安裝了其他基於 eBPF 的開源可觀測性工具，一切正常。這說明我的 k8s 集群肯定沒問題。

Hello @ztdawang could you share about what is tools that you used? I really want to try it too thanks you :)

pixie-io / pixie

Please add support for debian 12.0 #1627

kubectl -n pl logs -f vizier-metadata-dc587cd79-8c6h9