wazuh / wazuh-kubernetes

Wazuh - Wazuh Kubernetes
https://wazuh.com/
GNU General Public License v2.0
248 stars 153 forks source link

`add_kubernetes_metadata` causing frequent pod restarts in Wazuh manager #596

Closed davidcr01 closed 6 months ago

davidcr01 commented 6 months ago

Description

Related: https://github.com/wazuh/wazuh-docker/issues/1210

During the K8s testing of the mentioned issue, we are experiencing issues with the implementation of add_kubernetes_metadata in our Kubernetes (k8s) environment, specifically impacting our Wazuh manager pods.

Upon investigation, it appears that the add_kubernetes_metadata function is triggering errors during deployment, leading to frequent pod restarts.

The error message we are encountering is as follows:

2024-02-22T15:00:09.770Z ERROR [kubernetes] kubernetes/util.go:117 kubernetes: Querying for pod failed with error: pods "wazuh-manager-master-0" is forbidden: User "system:serviceaccount:wazuh:default" cannot get resource "pods" in API group "" in the namespace "wazuh" {"libbeat.processor": "add_kubernetes_metadata"}

It seems that the add_kubernetes_metadata function is being activated within Filebeat. We are unsure if this function has been intentionally activated by someone on the team or if it is a default setting. However, its activation is causing significant disruption to our deployment in Kubernetes, particularly with the Wazuh manager pods.

vcerenu commented 6 months ago

Several tests were carried out that could be related to the problem generated when starting the Wazuh manager image in Kubernetes:

When checking the pod in operation, it was found that the filebeat.yml file contained the default configuration that comes with installing the application. The error generated by the restart was analyzed and an incompatibility of the add_kubernetes_metadata processor with AmazonLinux 2023 was found, which caused the restart, but we continued investigating why the filebeat.yml file that we added in the Dockerfile was not impacted, in addition the tests generated problems we already had an error with modulesd that caused the pod to consume all the cluster resources and we could not continue, having to redeploy from scratch every time this happened (https://github.com/wazuh/wazuh/issues /22141).

Within the tests, the v4.8.0-beta1 tag branch was taken, which had the Dockerfile with the Ubuntu Jammy base image, all the changes up to beta2 were applied and new images were generated. With these images it was deployed in Kubernetes and the results were satisfactory. We tried changing the base image from v4.8.0-beta2 to amazonlinux:2, when deploying we had the same problems with the filebeat.yml file but the pod did not restart, since AL2 did not have this problem with the processor. After this, an attempt was made to mount the filebeat.yml file within the /etc/filebeat directory, but when doing so it generated errors very similar to those obtained when mounting theossec.conf file directly on the corresponding directory;

With ossec.conf we implement a solution in Docker in which we mount the file in a temporary directory and at the start of the container we copy that file to the corresponding directory, so we do not have permissions problems. I took the docker branch of the v4.8.0 tag and created a function that did the same thing but for the Filebeat directory, so that, by mounting the file at the beginning of the container, it would copy the file into the /etc/filebeat directory. This function worked correctly and by mounting a configmap in Kubernetes that pointed to the temporary directory, we were able to stop having add_kubernetes_metadata error messages, since the filebeat.yml file was stepping correctly, but we were having problems starting Filebeat and found that the file wazuh-template.yml was not found either, which did not allow the Wazuh template to be pushed. When this file was missing, we found that the files that were inside the directory were exactly the ones left after the base installation of Filebeat, so that gave me the clue that we had a rollback or something was bringing the base files back.

The permanent_data execution process was removed from the Dockerfile to test if this process could be interfering. When deploying the images created with this change I found that the files within the /etc/filebeatdirectory were not found. I tried deleting the /etc/filebeat directory from the permanet_data.env file and this had the same previous result, so I rebuilt images by eliminating the process of deleting the /var/ossec/data_tmp directory, which is where the directories that are stored are located. included in the permanent_data process and when starting the pod I found that the files that were saved were from the moment after the installation but before adding the modified parameter files, so I checked the Dockerfile and indeed the permanent_data process was running before added our parameter files, so the execution order was changed, it was deployed with new images and we had a positive result.