netobserv / netobserv-ebpf-agent

Network Observability eBPF Agent
Apache License 2.0
124 stars 31 forks source link

Pod crash loop back-off when deploying on OCP 4.10 on top of IBM Rocks #34

Closed andresmareca-ibm closed 2 years ago

andresmareca-ibm commented 2 years ago

I'm deploying this agent in an openshift managed cluster by IBM Cloud. During the deploy face I have an issue with some of the pods.

First I have deployed the operator network-observability-operator following the instructions in the README file.

After that I try to deploy what is on netobserv-ebpf-agent. And here it fails for the first time due to lack of serviceAccount permissions. After adding on the serviceAccount I was able to start the actual pods:

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
 name: netobserv-clusterrole
rules:
 - apiGroups: [""] # "" indicates the core API group
   resources:
   - nodes
   - nodes/proxy
   - services
   - endpoints
   - pods
   verbs: ["get", "watch", "list"]
 - apiGroups:
    - security.openshift.io
   resourceNames:
    - hostmount-anyuid
    - privileged
   resources:
    - securitycontextconstraints
   verbs:
    - use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: netobserv-rolebinding
subjects:
  - kind: ServiceAccount
    name: netobserv-account
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: netobserv-clusterrole

This is where it fails a second time. The output of the pods are:

Starting flowlogs-pipeline:
=====
Build Version: -c418a01
Build Date: 2022-06-20 07:55
Using configuration:
{
"PipeLine": "[{\"name\":\"ingest\"},{\"follows\":\"ingest\",\"name\":\"decode\"},{\"follows\":\"decode\",\"name\":\"enrich\"},{\"follows\":\"enrich\",\"name\":\"loki\"}]",
"Parameters": "[{\"ingest\":{\"grpc\":{\"port\":9999},\"type\":\"grpc\"},\"name\":\"ingest\"},{\"decode\":{\"type\":\"protobuf\"},\"name\":\"decode\"},{\"name\":\"enrich\",\"transform\":{\"network\":{\"rules\":[{\"input\":\"SrcAddr\",\"output\":\"SrcK8S\",\"type\":\"add_kubernetes\"},{\"input\":\"DstAddr\",\"output\":\"DstK8S\",\"type\":\"add_kubernetes\"}]},\"type\":\"network\"}},{\"name\":\"loki\",\"write\":{\"loki\":{\"labels\":[\"SrcK8S_Namespace\",\"SrcK8S_OwnerName\",\"DstK8S_Namespace\",\"DstK8S_OwnerName\",\"FlowDirection\"],\"staticLabels\":{\"app\":\"netobserv-flowcollector\"},\"timestampLabel\":\"TimeFlowEndMs\",\"timestampScale\":\"1ms\",\"type\":\"loki\",\"url\":\"http://loki:3100/\"},\"type\":\"loki\"}}]",
"Health": {
"Port": "8080"
}
}
time=2022-06-20T11:14:34Z level=debug msg=config.Opt.PipeLine = [{"name":"ingest"},{"follows":"ingest","name":"decode"},{"follows":"decode","name":"enrich"},{"follows":"enrich","name":"loki"}]
time=2022-06-20T11:14:34Z level=debug msg=stages = [{ingest } {decode ingest} {enrich decode} {loki enrich}]
time=2022-06-20T11:14:34Z level=debug msg=params = [{ingest 0xc0001ebe00 <nil> <nil> <nil> <nil>} {decode <nil> <nil> <nil> <nil> <nil>} {enrich <nil> 0xc0001ebe30 <nil> <nil> <nil>} {loki <nil> <nil> <nil> <nil> 0xc00004f800}]
time=2022-06-20T11:14:34Z level=debug msg=entering SetupElegantExit
time=2022-06-20T11:14:34Z level=debug msg=registered exit signal channel
time=2022-06-20T11:14:34Z level=debug msg=exiting SetupElegantExit
time=2022-06-20T11:14:34Z level=debug msg=entering NewPipeline
time=2022-06-20T11:14:34Z level=debug msg=stages = [{ingest } {decode ingest} {enrich decode} {loki enrich}]
time=2022-06-20T11:14:34Z level=debug msg=configParams = [{ingest 0xc0001ebe00 <nil> <nil> <nil> <nil>} {decode <nil> <nil> <nil> <nil> <nil>} {enrich <nil> 0xc0001ebe30 <nil> <nil> <nil>} {loki <nil> <nil> <nil> <nil> 0xc00004f800}]
time=2022-06-20T11:14:34Z level=debug msg=stage = ingest
time=2022-06-20T11:14:34Z level=debug msg=findStageType: stage = ingest
time=2022-06-20T11:14:34Z level=debug msg=pipeline = [0xc0002a42a0]
time=2022-06-20T11:14:34Z level=debug msg=stage = decode
time=2022-06-20T11:14:34Z level=debug msg=findStageType: stage = decode
time=2022-06-20T11:14:34Z level=fatal msg=failed to initialize pipeline invalid stage type: unknown

Any ideas on how to fix it?

eranra commented 2 years ago

@jotak can you take a look ??? is this connected to the latest PR that removed the decode stage ???

eranra commented 2 years ago

@andresmareca-ibm thanks for looking into this .... I think that this might be more connected to FLP and NOO repo's but having the issue here is also ok.

jotak commented 2 years ago

Hi @andresmareca-ibm , What is the image versions of the operator, and of flowlogs-pipeline ? Are they both main ? This is likely due to a version mismatch, as @eranra mentioned there was a recent breaking change that has a corresponding update on the operator side. I guess you don't have the operator patch? If you built and deployed the operator from source, maybe you wasn't up to date? Or maybe you still have an old image in your cluster, in which case you may double-check if the operator's image pull policy is set to Always.

Another option is to prefer using released versions rather than main (but of course you won't get the very last updates)

jotak commented 2 years ago

By the way,

After that I try to deploy what is on netobserv-ebpf-agent. And here it fails for the first time due to lack of serviceAccount permissions. After adding on the serviceAccount I was able to start the actual pods

@mariomac , any idea about that?

mariomac commented 2 years ago

@andresmareca-ibm could I see your FlowCollector deployment file?

Also, if possible, can I see the netobserv-controller-manager pod logs?

mariomac commented 2 years ago

I just deployed the main version and it worked. Are you using any other version?

image
mariomac commented 2 years ago

I've been digging into the default permissions that we grant to the eBPF agent. The netobserv-manager-role grants security context constraints for the host network but there isn't any rolebinding that assigns the permissions for the netobserv agent to use security context constraints. By any reason, in our installations it's granted by default but not in yours.

In order to try to reproduce the issue, and verify that we provide a patch that will actually work, what version of OpenShift are you using?

andresmareca-ibm commented 2 years ago

I just deployed the main version and it worked. Are you using any other version? image

I just pull both repos yesterday at 7:30 UTC and run the make commands on the main branch

andresmareca-ibm commented 2 years ago

I've been digging into the default permissions that we grant to the eBPF agent. The netobserv-manager-role grants security context constraints for the host network but there isn't any rolebinding that assigns the permissions for the netobserv agent to use security context constraints. By any reason, in our installations it's granted by default but not in yours.

In order to try to reproduce the issue, and verify that we provide a patch that will actually work, what version of OpenShift are you using?

I deploy the cluster using the IBM Cloud procedure. The version is: 4.10.16_1521

andresmareca-ibm commented 2 years ago

I'm going to create a new cluster and reapply the scripts in the following order:

  1. Clone https://github.com/netobserv/network-observability-operator and run make ocp-deploy
  2. Clone https://github.com/netobserv/netobserv-ebpf-agent and run make ocp-deploy

At the end I should have the same pods as you have on the pictures from above, right??

mariomac commented 2 years ago

@andresmareca-ibm you don't need to clone the netobserv-ebpf-agent repo, as the newtork-observability-operator will directly refer to the latest deployed image in quay.

In the NOO repo, you should do:

make deploy ocp-deploy

Then you can deploy the example flowcollector:

oc apply -f config/samples/flows_v1alpha1_flowcollector.yaml

If you want the eBPF agent to be deployed, you should set the agent: ebpf property in the descriptor

eranra commented 2 years ago

@mariomac BTW: I use make ocp-run which does all of those including deployment of a sample workload etc ... then I just change the CR to use eBPF ./... I this ends up with the same thing.

eranra commented 2 years ago

FYI:: @ctrath ^^^

andresmareca-ibm commented 2 years ago

I'm getting this error during the ebpf container creation. Error: unknown capability "CAP_BPF" to add

mariomac commented 2 years ago

@andresmareca-ibm this means that the kernel does not support this capability. By curiosity, which Linux distribution and Kernel version are you using?

Anyway, you can workaround this issue by adding privileged: true property to the ebpf: section of the flowCollector: https://github.com/netobserv/network-observability-operator/blob/main/config/samples/flows_v1alpha1_flowcollector.yaml#L21

andresmareca-ibm commented 2 years ago

@andresmareca-ibm this means that the kernel does not support this capability. By curiosity, which Linux distribution and Kernel version are you using?

Anyway, you can workaround this issue by adding privileged: true property to the ebpf: section of the flowCollector: https://github.com/netobserv/network-observability-operator/blob/main/config/samples/flows_v1alpha1_flowcollector.yaml#L21

The error that the pods do not start we have already solved it. We have had to create a service account with the necessary permissions as shown below.

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: netobserv-ebpf-agent-test
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
 name: netobserv-clusterrole
rules:
 - apiGroups: [""] # "" indicates the core API group
   resources:
   - nodes
   - nodes/proxy
   - services
   - endpoints
   - pods
   verbs: ["get", "watch", "list"]
 - apiGroups:
    - security.openshift.io
   resourceNames:
    - hostmount-anyuid
    - privileged
    - cgroup
   resources:
    - securitycontextconstraints
   verbs:
    - use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: netobserv-rolebinding
subjects:
  - kind: ServiceAccount
    name: netobserv-ebpf-agent-test
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: netobserv-clusterrole

I close this issue but there will be another one because another error has come out. Thank you very much!!