Visibility logs with `Result: Unknown error` for network operations

DelusionalOptimist commented 1 year ago

Bug Report

General Information

Environment description (GKE, VM-Kubeadm, vagrant-dev-env, minikube, microk8s, ...) - any
Kernel version (run uname -a) - 5.15.0-46-generic
Orchestration system version in use (e.g. kubectl version, ...) - 1.23.17
Link to relevant artifacts (policies, deployments scripts, ...) - NA
Target containers/pods - NA

To Reproduce

Deploy kubearmor
Deploy multiubuntu (or any other server of your choice).
Try to see Network operation logs using karmor logs --logFilter=all --operation=Network.
Exec into one of the ubuntu containers and try to curl localhost:8000 which would create a TCP_ACCEPT syscall by the running python server
You'll find a log with Result: Unknown error in logs even though the request passes.

Expected behavior The log should have Result: Passed instead.

Logs

{
  "Timestamp": 1678430968,
  "UpdatedTime": "2023-03-10T06:49:28.840099Z",
  "ClusterName": "default",
  "HostName": "kubearmor-dev-next",
  "NamespaceName": "multiubuntu",
  "PodName": "ubuntu-1-deployment-5bd4dff469-fwn2v",
  "Labels": "container=ubuntu-1,group=group-1",
  "ContainerID": "1584076e2121453ceae9de662c834d73c6bd9a2357432d77bf9527d441b00c13",
  "ContainerName": "ubuntu-1-container",
  "ContainerImage": "docker.io/kubearmor/ubuntu-w-utils:0.1@sha256:b4693b003ed1fbf7f5ef2c8b9b3f96fd853c30e1b39549cf98bd772fbd99e260",
  "ParentProcessName": "/bin/bash",
  "ProcessName": "/usr/bin/python2.7",
  "HostPPID": 8858,
  "HostPID": 8962,
  "PPID": 1,
  "PID": 7,
  "Type": "ContainerLog",
  "Source": "/usr/bin/python2.7",
  "Operation": "Network",
  "Resource": "remoteip=127.0.0.1 port=8000 protocol=TCP",
  "Data": "kprobe=tcp_accept domain=AF_INET",
  "Result": "Unknown error"
}

Amishakumari544 commented 1 year ago

Can I work on this?

DelusionalOptimist commented 1 year ago

Hey @Amishakumari544, sure. Can you start out by trying to reproduce this on your system and analyzing what might be happening?

Amishakumari544 commented 1 year ago

yes sure, I will first try to see what happens in my system.

xiao-jay commented 1 year ago

i found all "Result":"Unknown error" ,the data is "Data": "kprobe=tcp_accept,maybe they are related.

== Log / 2023-03-16 12:44:07.599978 ==
ClusterName: default
HostName: yyj-test-node1
NamespaceName: cert-manager
PodName: cert-manager-webhook-8bc4cf7d8-r659d
Labels: app.kubernetes.io/instance=cert-manager,app.kubernetes.io/name=webhook,app.kubernetes.io/version=v1.8.0,app=webhook,app.kubernetes.io/component=webhook
ContainerName: cert-manager
ContainerID: 77d19de9a3e6cb351699208ef231ba5f9f7ad56893e2149230ae4b1f08b90c77
ContainerImage: sealos.hub:5000/jetstack/cert-manager-webhook:v1.8.0@sha256:fd798a5a773e5a69880caa60f050d3ffd237c4c3ec3289aacaf59f938c8d7668
Type: ContainerLog
Source: /app/cmd/webhook/webhook
Resource: remoteip=10.35.128.65 port=6080 protocol=TCPv6
Operation: Network
Data: kprobe=tcp_accept domain=AF_INET6
Result: Unknown error
HostPID: 4.1131e+06
HostPPID: 4.105252e+06
PID: 1
PPID: 4.105252e+06
ParentProcessName: /usr/bin/containerd-shim-runc-v2
ProcessName: /app/cmd/webhook/webhook
UID: 1000

xiao-jay commented 1 year ago

May be know errorno can solve the problem ? @DelusionalOptimist

DelusionalOptimist commented 1 year ago

Thank you for looking into this @xiao-jay. :rocket: Yes, getting the error code in logs would definitely help us debug this. However if you notice, even though the requests pass successfully, we're still getting a negative error code. This shouldn't happen as passed commands have a 0 error code. So the problem here might be related to the data captured for tcp_accept syscall in system monitor.

xiao-jay commented 1 year ago

@DelusionalOptimist If I want to build the newest code,and run karmor logs --logFilter=all --operation=Networkto get the newest result. Please tell me how to do.

DelusionalOptimist commented 1 year ago

@xiao-jay The best way to test your changes to kubearmor is:

Have any pre-existing kubernetes cluster (k3s, kubeadm)
Run kubectl proxy &
Build using make run in the KubeArmor directory.

This should run kubearmor locally for you. Also, we generally prefer developing with kubearmor in vagrant env which takes care of setting all the dependencies. See - https://github.com/kubearmor/KubeArmor/blob/main/contribution/development_guide.md

Once you have that, you should be able to reproduce this issue.

xiao-jay commented 1 year ago

@DelusionalOptimist kubectl get pod

``` root@yyj-test-master1:/home/ubuntu# kubectl get pod -A NAMESPACE NAME READY STATUS RESTARTS AGE accuknox-agents discovery-engine-784b54897b-lkrkn 1/1 Running 1 (3d9h ago) 7d11h calico-system calico-kube-controllers-6b44b54755-j4nvg 1/1 Running 10 (3d5h ago) 4d1h calico-system calico-node-7ww2h 1/1 Running 6 (3d5h ago) 4d1h calico-system calico-node-nn86s 1/1 Running 9 (3d5h ago) 4d1h calico-system calico-typha-5d687fb4d7-c4v88 1/1 Running 5 (3d5h ago) 4d1h cert-manager cert-manager-cainjector-ccb9bc698-r92xr 1/1 Running 7 (3d5h ago) 7d11h cert-manager cert-manager-cc5954c5f-lvk29 1/1 Running 7 (3d5h ago) 7d11h cert-manager cert-manager-webhook-8bc4cf7d8-r659d 1/1 Running 4 (3d5h ago) 7d11h default nginx-deployment-58487c6fb6-h8kxq 1/1 Running 1 (3d9h ago) 7d11h default nginx-deployment-58487c6fb6-ndrtd 1/1 Running 1 (3d9h ago) 7d11h kube-system coredns-6d4b75cb6d-5pjc7 1/1 Running 1 (3d9h ago) 4d1h kube-system coredns-6d4b75cb6d-dzcg2 1/1 Running 3 (3d5h ago) 4d1h kube-system etcd-yyj-test-master1 1/1 Running 2 (3d9h ago) 4d1h kube-system kube-apiserver-yyj-test-master1 1/1 Running 6 (2d23h ago) 4d1h kube-system kube-controller-manager-yyj-test-master1 1/1 Running 30 (3d9h ago) 4d1h kube-system kube-proxy-57rz8 1/1 Running 1 (3d9h ago) 4d1h kube-system kube-proxy-t8kwb 1/1 Running 1 (3d9h ago) 4d1h kube-system kube-scheduler-yyj-test-master1 1/1 Running 32 (3d9h ago) 4d1h kube-system kube-sealos-lvscare-yyj-test-node1 1/1 Running 10 (3d9h ago) 4d1h kube-system kubearmor-annotation-manager-96d6f45b6-29tlf 2/2 Running 0 12m kube-system kubearmor-host-policy-manager-684c8f47-wm8dc 2/2 Running 0 11m kube-system kubearmor-policy-manager-5b95f8b79b-jcczz 2/2 Running 0 11m kube-system kubearmor-relay-64c6fff875-vq6jt 1/1 Running 0 16m tigera-operator tigera-operator-59975d9967-9wk48 1/1 Running 333 (2d1h ago) 7d11h wordpress-mysql mysql-6c6fcdccf-9bz5r 1/1 Running 1 (3d9h ago) 6d22h wordpress-mysql wordpress-84dbf54bb8-fbn4w 1/1 Running 217 (2d1h ago) 6d22h ```

I get many error from calico,the error code is very big

root@yyj-test-master1:/home/ubuntu/kubearmor/KubeArmor/KubeArmor/monitor# karmor logs --logFilter=all --operation=Network|grep Unknown  -C 10
local port to be used for port forwarding kubearmor-relay-64c6fff875-vq6jt: 32825 
Created a gRPC client (localhost:32825)
Checked the liveness of the gRPC server
Started to watch alerts
Started to watch logs
PodName: calico-node-7ww2h
Labels: k8s-app=calico-node
ContainerName: calico-node
ContainerID: 893549bf092886a768822fadb3391c759db757826231301bc47dd655445e4dcb
ContainerImage: sealos.hub:5000/calico/node:v3.22.1@sha256:1f8ed83e5264b4206cce7e1def11bca0b3ea7d5f4eb9b0ca0dbfc8cb968ca57e
Type: ContainerLog
Source: /usr/bin/calico-node
Resource: remoteip=127.0.0.1 port=9099 protocol=TCP
Operation: Network
Data: kprobe=tcp_accept domain=AF_INET
Result: Unknown (-129904102617408)
HostPID: 114089
HostPPID: 114074
PID: 87
PPID: 72
ParentProcessName: /usr/local/bin/runsv
ProcessName: /usr/bin/calico-node
== Log / 2023-03-21 13:11:10.249191 ==
ClusterName: default
HostName: yyj-test-master1
Type: HostLog
--
PodName: calico-typha-5d687fb4d7-c4v88
Labels: k8s-app=calico-typha
ContainerName: calico-typha
ContainerID: 36f6878dfe152b239a5f0b8608dad00b89fa8bb0915d4c1bf260b93bb44003ba
ContainerImage: sealos.hub:5000/calico/typha:v3.22.1@sha256:e36532ff56568fc324fa93f3e0a0005f899564ab72903c035f8f4d4e9378b6b9
Type: ContainerLog
Source: /code/calico-typha
Resource: remoteip=127.0.0.1 port=9098 protocol=TCP
Operation: Network
Data: kprobe=tcp_accept domain=AF_INET
Result: Unknown (-129904102630848)
HostPID: 113684
HostPPID: 113671
PID: 7
PPID: 1
ParentProcessName: /sbin/tini
ProcessName: /code/calico-typha
UID: 999
== Log / 2023-03-21 13:11:10.253316 ==
ClusterName: default
HostName: yyj-test-master1
--
PodName: calico-typha-5d687fb4d7-c4v88
Labels: k8s-app=calico-typha
ContainerName: calico-typha
ContainerID: 36f6878dfe152b239a5f0b8608dad00b89fa8bb0915d4c1bf260b93bb44003ba
ContainerImage: sealos.hub:5000/calico/typha:v3.22.1@sha256:e36532ff56568fc324fa93f3e0a0005f899564ab72903c035f8f4d4e9378b6b9
Type: ContainerLog
Source: /code/calico-typha
Resource: remoteip=127.0.0.1 port=9098 protocol=TCP
Operation: Network
Data: kprobe=tcp_accept domain=AF_INET
Result: Unknown (-129904435649792)
HostPID: 113684
HostPPID: 113671
PID: 7
PPID: 1
ParentProcessName: /sbin/tini
ProcessName: /code/calico-typha
UID: 999
== Log / 2023-03-21 13:11:10.417936 ==
ClusterName: default
HostName: yyj-test-master1
--
PodName: calico-node-7ww2h
Labels: k8s-app=calico-node
ContainerName: calico-node
ContainerID: 893549bf092886a768822fadb3391c759db757826231301bc47dd655445e4dcb
ContainerImage: sealos.hub:5000/calico/node:v3.22.1@sha256:1f8ed83e5264b4206cce7e1def11bca0b3ea7d5f4eb9b0ca0dbfc8cb968ca57e
Type: ContainerLog
Source: /usr/bin/calico-node
Resource: remoteip=127.0.0.1 port=9099 protocol=TCP
Operation: Network
Data: kprobe=tcp_accept domain=AF_INET
Result: Unknown (-129902512805632)
HostPID: 114089
HostPPID: 114074
PID: 87
PPID: 72
ParentProcessName: /usr/local/bin/runsv
ProcessName: /usr/bin/calico-node
== Log / 2023-03-21 13:11:20.257554 ==
ClusterName: default
HostName: yyj-test-master1
Type: HostLog

DelusionalOptimist commented 1 year ago

@xiao-jay right, which is the problem we're trying to solve. You're getting these logs from Calico as it is making tcp_accept syscalls. The syscall pass successfully so these logs should be having Result: Passed. However the logs are having this big error code instead. You can also tryout by sending a curl request to a server. (try nginx running in your setup) When the server accepts your requests it'll create a tcp_accept syscall and kubearmor will capture it, and show it in the logs.

xiao-jay commented 1 year ago

@xiao-jay right, which is the problem we're trying to solve. You're getting these logs from Calico as it is making tcp_accept syscalls. The syscall pass successfully so these logs should be having Result: Passed. However the logs are having this big error code instead. You can also tryout by sending a curl request to a server. (try nginx running in your setup) When the server accepts your requests it'll create a tcp_accept syscall and kubearmor will capture it, and show it in the logs.

Thank you for your suggestion，i will try it now.

xiao-jay commented 1 year ago

I try exec curl localhost:8000 at multiubuntu pod inside,but get tcp_connect and passed result.For this very big retva number I found some regularity. If you convert them to binary, they have 48 bits, the first 16 bits are same.

-11101100010010110110111000010110011101000000000
-11101100010010110100110101000010101110100000000
-11101100010010101001001000100011100001011000000

I am not familar with BPF and the c code.So it's hard for me to solve the problem single. I found the code about accopt, found context.retval = PT_REGS_RC(ctx);,retval is set by PT_REGS_RC(ctx) func ,but I not found the PT_REGS_RC func info, so could you please give me some help for the c code and BPF. @DelusionalOptimist code in system_monitor.c

``` SEC("kretprobe/__x64_sys_inet_csk_accept") int kretprobe__inet_csk_accept(struct pt_regs *ctx) { if (skip_syscall()) return 0; struct sock *newsk = (struct sock *)PT_REGS_RC(ctx); if (newsk == NULL) return 0; // Code from https://github.com/iovisor/bcc/blob/master/tools/tcpaccept.py with adaptations u16 protocol = 1; int gso_max_segs_offset = offsetof(struct sock, sk_gso_max_segs); int sk_lingertime_offset = offsetof(struct sock, sk_lingertime); if (sk_lingertime_offset - gso_max_segs_offset == 2) #if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 6, 0) protocol = READ_KERN(newsk->sk_protocol); #else protocol = newsk->sk_protocol; #endif else if (sk_lingertime_offset - gso_max_segs_offset == 4) // 4.10+ with little endian #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ protocol = READ_KERN(*(u8 *)((u64)&newsk->sk_gso_max_segs - 3)); else // pre-4.10 with little endian protocol = READ_KERN(*(u8 *)((u64)&newsk->sk_wmem_queued - 3)); #elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ // 4.10+ with big endian protocol = READ_KERN(*(u8 *)((u64)&newsk->sk_gso_max_segs - 1)); else // pre-4.10 with big endian protocol = READ_KERN(*(u8 *)((u64)&newsk->sk_wmem_queued - 1)); #else #error "Fix your compiler's __BYTE_ORDER__?!" #endif if (protocol != IPPROTO_TCP) return 0; struct sock_common conn = READ_KERN(newsk->__sk_common); struct sockaddr_in sockv4; struct sockaddr_in6 sockv6; sys_context_t context = {}; args_t args = {}; u64 types = ARG_TYPE0(STR_T) | ARG_TYPE1(SOCKADDR_T); init_context(&context); context.argnum = get_arg_num(types); context.retval = PT_REGS_RC(ctx); if (context.retval >= 0 && drop_syscall(_NETWORK_PROBE)) { return 0; } if (get_connection_info(&conn, &sockv4, &sockv6, &context, &args, _TCP_ACCEPT) != 0) { return 0; } args.args[0] = (unsigned long)conn.skc_prot->name; set_buffer_offset(DATA_BUF_TYPE, sizeof(sys_context_t)); bufs_t *bufs_p = get_buffer(DATA_BUF_TYPE); if (bufs_p == NULL) return 0; save_context_to_buffer(bufs_p, (void *)&context); save_args_to_buffer(types, &args); events_perf_submit(ctx); return 0; } ```

DelusionalOptimist commented 1 year ago

cc @daemon1024

DelusionalOptimist commented 1 year ago

@xiao-jay I'm not really sure about the above. In the meantime, you should try looking into https://github.com/kubearmor/KubeArmor/pull/1087. It seems to be working fine before that.

xiao-jay commented 1 year ago

Thank you for your suggestion,I will try to find out the reason.

DelusionalOptimist commented 1 year ago

Hey @xiao-jay were you able to find anything?

stefin9898 commented 1 year ago

/assign

stefin9898 commented 1 year ago

https://github.com/kubearmor/KubeArmor/blob/4a0d0af2a5d647bf54e7069c4a2ce81e3bacdc09/KubeArmor/BPF/system_monitor.c#L1873 Here the function is instrumented to be a kprobe but using PT_REGS_RC(ctx) which is only supported in kretprobe. I suspect this could be the reason for the uninitialized retvalue in the monitored events. CC: @DelusionalOptimist

daemon1024 commented 1 year ago

@stefin9898 Makes sense, we should be extracting RC from kretprobe only and send to perf buffer from kretprobe.

stefin9898 commented 1 year ago

@daemon1024 I tried marking the function as a kretprobe still I see the retval as something uninitialized. Edit: At this stage https://github.com/kubearmor/KubeArmor/blob/4a0d0af2a5d647bf54e7069c4a2ce81e3bacdc09/KubeArmor/BPF/system_monitor.c#L1838 would it be safe to assign the retvalue to 0 manually or would that be a NO ?

xiao-jay commented 1 year ago

https://github.com/kubearmor/KubeArmor/blob/4a0d0af2a5d647bf54e7069c4a2ce81e3bacdc09/KubeArmor/BPF/system_monitor.c#L1873

Here the function is instrumented to be a kprobe but using PT_REGS_RC(ctx) which is only supported in kretprobe. I suspect this could be the reason for the uninitialized retvalue in the monitored events. 这里函数被检测为 kprobe 但使用 PT_REGS_RC(ctx) 仅在 kretprobe 中受支持。我怀疑这可能是监视事件中未初始化重值的原因。

CC: @DelusionalOptimist 抄送： @DelusionalOptimist

@stefin9898 this error only appear in tcp_accept,your code url use in tcp_connect,may be the right code is (https://github.com/kubearmor/KubeArmor/blob/4a0d0af2a5d647bf54e7069c4a2ce81e3bacdc09/KubeArmor/BPF/system_monitor.c#L1946

because this use _TCP_ACCEPT https://github.com/kubearmor/KubeArmor/blob/4a0d0af2a5d647bf54e7069c4a2ce81e3bacdc09/KubeArmor/BPF/system_monitor.c#L1953

xiao-jay commented 1 year ago

@xiao-jay I'm not really sure about the above. In the meantime, you should try looking into #1087. It seems to be working fine before that.

I know why after this pr will appear error

xiao-jay commented 1 year ago

Hey @xiao-jay were you able to find anything?

I know nothing about ebpf,Please ask @stefin9898 to help solve this problem.I hope my previous comments can help you.

daemon1024 commented 1 year ago

@xiao-jay Thanks a lot for pointing to the right direction. I checked the implementation

https://github.com/torvalds/linux/blob/865fdb08197e657c59e74a35fa32362b12397f58/net/ipv4/inet_connection_sock.c#LL657C11-L657C11

It doesn't return error code directly, We might need to extract the error code in a different way from the function. That's why we are seeing ambigous values as Return Codes. @xiao-jay This helps a lot, Thank you again.

kubearmor / KubeArmor

Visibility logs with `Result: Unknown error` for network operations #1148

Bug Report