Closed uselessidbr closed 4 years ago
I noticed it starts to happen as soon as elastic search roll out.
If I stop ES pod the fluentd containers starts to run again. Is it something related to ES memory limit?
It seems the fluentd image is broken. I downgraded it to image version tag v3.10 and the container is now running. I’m not sure what is the github repository responsible for this image.
As I have another setup that is running with no issues I discovered the image digest and downloaded it to my new system (where the pods are crashing). After that I changed the daemoset to the "v3.10" image tag so I can delete the image with the tag v3.11 and use this tag with the one I've pulled.
After that I changed back the daemonset to v3.11 tag again and deleted the v3.10 image.
yum install -y podman
crictl pull openshift/origin-logging-fluentd@sha256:c26a9c143c76e32982c4acb520b9f468b303e5df4f0f8873364acae0f5083e0a
*** Now you must change the daemonset to use image with the "v3.10" tag and wait to all pods to terminate so you are able to the delete the v3.11 tag image as it is not in use
[root@master1 ~]# crictl images | grep fluentd docker.io/openshift/origin-logging-fluentd v3.11 33b86066482f9 480MB [root@master1 ~]# crictl rmi 33b86066482f9
podman tag 33b86066482f9 docker.io/openshift/origin-logging-fluentd:v3.11
*** Now you must change the daemonset back to use image with the "v3.11" tag and wait to all pods to terminate so you are able to the delete the "v3.10" tag image as it is not in use
[root@master1 ~]# crictl images | grep fluentd docker.io/openshift/origin-logging-fluentd v3.10 33b86066482f9 480MB [root@master1 ~]# crictl rmi 33b86066482f9
Finally its working :)
I encountered this error when I restarted fluentd on OKD 3.11 and a new image was pulled (tag v3.11 was recently updated).
I was able to workaround it by adding METRICS_CERT and METRICS_KEY to the DeamonSet definition, either after deployment, or before in the template at roles/openshift_logging_fluentd/templates/fluentd.j2
It seems that the following change is involved: https://github.com/openshift/origin-aggregated-logging/pull/1565
This is only a workaround though. Someone involved in the change above should be able to give more insight for a proper fix.
I encountered this error when I restarted fluentd on OKD 3.11 and a new image was pulled (tag v3.11 was recently updated).
I was able to workaround it by adding METRICS_CERT and METRICS_KEY to the DeamonSet definition, either after deployment, or before in the template at roles/openshift_logging_fluentd/templates/fluentd.j2
It seems that the following change is involved: openshift/origin-aggregated-logging#1565
This is only a workaround though. Someone involved in the change above should be able to give more insight for a proper fix.
It could be related but this pull request is from 10 months ago and the image was working at least until 21 Jan. Somepoint after that the image broke.
I see they merged it about 10 months ago, I don’t know if it should be the cause.
@uselessidbr Thank you for your information. I could run openshift-logging-fluentd with following commands at master.
oc login -u system:admin
oc -n openshift-logging set image daemonset logging-fluentd *=openshift/origin-logging-fluentd@sha256:c26a9c143c76e32982c4acb520b9f468b303e5df4f0f8873364acae0f5083e0a
I have same trouble (OKD 3.11), only 3.10 image does work success
We are investigating but you can always work around the issue by building the image yourself from:
https://github.com/openshift/origin-aggregated-logging/blob/release-3.11/fluentd/Dockerfile.centos7
$fluentdir>docker build -t openshift/logging-fluentd:v3.11 -f Dockerfile.centos7 .
The latest image had the same digest as 3.11 and 4.x tag, just make sure they are decoupled to grant that 3.11 will not be updated when latest is updated.
oc -n openshift-logging set image daemonset logging-fluentd *=openshift/origin-logging-fluentd@sha256:c26a9c143c76e32982c4acb520b9f468b303e5df4f0f8873364acae0f5083e0a
It worked for me as well. Thanks you.
Does the problem persist in new deployments? The v3.11 image digest still the same as v4.0 and latest.
Em dom, 9 de fev de 2020 às 14:11, Arunabha Banerjee < notifications@github.com> escreveu:
oc -n openshift-logging set image daemonset logging-fluentd *=openshift/origin-logging-fluentd@sha256 :c26a9c143c76e32982c4acb520b9f468b303e5df4f0f8873364acae0f5083e0a
It worked for me as well. Thanks you.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/openshift-ansible/issues/12089?email_source=notifications&email_token=AK5OIZ7W2H6UA3IP55X6G7DRCA2LPA5CNFSM4KLS4XK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELGSGYI#issuecomment-583869281, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK5OIZYNB3KZDIGMTES6GZTRCA2LPANCNFSM4KLS4XKQ .
Description
Provide a brief description of your issue here. For example:
Installing OKD v3.11 with ELK the fluentd containers keep crashing with state "Crash Loop Back-off".
I've tried to install the cluster complying with requirements.txt for pip but It was throwing an error saying "_es_node is undefined" and it would stop the playbook at openshif-logging install stage, although it didn't show any failed item.
After downgrading ansible (via pip) to 2.8.1 (also tried with 2.6, 2.6.2, 2.6.4, 2.8.4) fluentd pods cant start.
I've tried some different commits as I could install it with success 4 days ago, so I switch to some commit from Jan 16. No luck.
Version
Please put the following version information in the code block indicated below.
ansible --version
ansible 2.8.1 config file = /home/ansible/openshift-ansible/ansible.cfg configured module search path = [u'/home/ansible/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /home/ansible/.local/lib/python2.7/site-packages/ansible executable location = /home/ansible/.local/bin/ansible python version = 2.7.5 (default, Aug 7 2019, 00:51:29) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
If you're operating from a git clone:
git describe
If you're running from playbooks installed via RPM
rpm -q openshift-ansible
Place the output between the code block below:
Steps To Reproduce
Expected Results
Fluentd containers to be running
Observed Results
Describe what is actually happening.
For some reason, the fluentd is in CrashLoopBack-Off state as it cant't find /etc/fluent/metrics/tls.crt
For long output or logs, consider using a gist
Additional Information
Provide any additional information which may help us diagnose the issue.
$ cat /etc/redhat-release
)