openshift / openshift-ansible

Install and config an OpenShift 3.x cluster
https://try.openshift.com
Apache License 2.0
2.16k stars 2.31k forks source link

Fluentd Crash Loop Back-Off state - No such file or directory /etc/fluent/metrics/tls.crt #12089

Closed uselessidbr closed 4 years ago

uselessidbr commented 4 years ago

Description

Provide a brief description of your issue here. For example:

On a multi master install, if the first master goes down we can no longer scaleup the cluster with new nodes or masters.

Installing OKD v3.11 with ELK the fluentd containers keep crashing with state "Crash Loop Back-off".

I've tried to install the cluster complying with requirements.txt for pip but It was throwing an error saying "_es_node is undefined" and it would stop the playbook at openshif-logging install stage, although it didn't show any failed item.

After downgrading ansible (via pip) to 2.8.1 (also tried with 2.6, 2.6.2, 2.6.4, 2.8.4) fluentd pods cant start.

I've tried some different commits as I could install it with success 4 days ago, so I switch to some commit from Jan 16. No luck.

Version

Please put the following version information in the code block indicated below.

ansible 2.8.1 config file = /home/ansible/openshift-ansible/ansible.cfg configured module search path = [u'/home/ansible/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /home/ansible/.local/lib/python2.7/site-packages/ansible executable location = /home/ansible/.local/bin/ansible python version = 2.7.5 (default, Aug 7 2019, 00:51:29) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]

If you're operating from a git clone:

If you're running from playbooks installed via RPM

Place the output between the code block below:

VERSION INFORMATION HERE PLEASE
openshift-ansible-3.11.165-1
Steps To Reproduce
  1. Download git repository from release-3.11
  2. Change ansible version in requirements.txt
  3. Run prereq
  4. Run deploy
Expected Results

Fluentd containers to be running

[root@master2 ~]# oc get pods -n openshift-logging

logging-es-data-master-x4gi8bkf-1-69f6c   2/2       Running            0          21m
logging-fluentd-26fwc                     0/1       CrashLoopBackOff   8          21m
logging-fluentd-9qb28                     0/1       CrashLoopBackOff   8          21m
logging-fluentd-bnfzv                     0/1       CrashLoopBackOff   8          21m
logging-fluentd-d6vbl                     0/1       CrashLoopBackOff   8          21m
logging-fluentd-kw8bp                     0/1       CrashLoopBackOff   8          21m
logging-fluentd-xbshv                     0/1       CrashLoopBackOff   8          21m
logging-fluentd-zcrgz                     0/1       CrashLoopBackOff   8          21m
logging-kibana-1-j9mm7                    2/2       Running            0          22m
Observed Results

Describe what is actually happening.

For some reason, the fluentd is in CrashLoopBack-Off state as it cant't find /etc/fluent/metrics/tls.crt

[root@master2 ~]# cat /var/log/fluentd/fluentd.log

2020-01-25 16:24:03 -0300 [error]: unexpected error error_class=Errno::ENOENT error="No such file or directory @ rb_sysopen - /etc/fluent/metrics/tls.crt"
2020-01-25 16:24:03 -0300 [error]: suppressed same stacktrace
2020-01-25 16:24:03 -0300 [error]: fluent/log.rb:362:error: unexpected error error_class=Errno::ENOENT error="No such file or directory @ rb_sysopen - /etc/fluent/metrics/tls.crt"
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluent-plugin-prometheus-1.3.0/lib/fluent/plugin/in_prometheus.rb:68:in read' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluent-plugin-prometheus-1.3.0/lib/fluent/plugin/in_prometheus.rb:68:in start'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/root_agent.rb:203:in block in start' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/root_agent.rb:192:in block (2 levels) in lifecycle'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/root_agent.rb:191:in each' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/root_agent.rb:191:in block in lifecycle'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/root_agent.rb:178:in each' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/root_agent.rb:178:in lifecycle'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/root_agent.rb:202:in start' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/engine.rb:274:in start'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/engine.rb:219:in run' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/supervisor.rb:805:in run_engine'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/supervisor.rb:549:in block in run_worker' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/supervisor.rb:730:in main_process'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/supervisor.rb:544:in run_worker' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/command/fluentd.rb:316:in <top (required)>'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/share/rubygems/rubygems/core_ext/kernel_require.rb:59:in require' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/share/rubygems/rubygems/core_ext/kernel_require.rb:59:in require'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/bin/fluentd:8:in <top (required)>' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/bin/fluentd:23:in load'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/bin/fluentd:23:in `

For long output or logs, consider using a gist

Additional Information

Provide any additional information which may help us diagnose the issue.

EXTRA INFORMATION GOES HERE

[root@master2 ~]# cat /etc/redhat-release
CentOS Linux release 7.7.1908 (Core)

[OSEv3:vars]
ansible_ssh_user=ansible
ansible_become=true
openshift_deployment_type=origin
openshift_release=v3.11

openshift_master_cluster_method=native

openshift_portal_net=10.1.128.0/18
osm_cluster_network_cidr=10.1.0.0/17
osm_host_subnet_length=9

openshift_use_calico=True
openshift_use_openshift_sdn=False
os_sdn_network_plugin_name='cni'

openshift_console_install=true
openshift_console_hostname=console.openshift.local

openshift_master_cluster_hostname=okd-int.openshift.local

openshift_master_cluster_public_hostname=okd.openshift.local

openshift_master_default_subdomain=apps.openshift.local

openshift_disable_check=disk_availability,docker_storage,memory_availability,docker_image_availability

openshift_use_crio=True
openshift_use_crio_only=False
openshift_crio_enable_docker_gc=True

openshift_docker_options=--bip 10.0.0.1/24 --log-opt  max-size=100M --log-opt max-file=3 --insecure-registry 10.1.128.0/17 --insecure-registry 10.0.0.0/24 --log-driver=json-file

openshift_master_identity_providers=[{'name': 'Local Authentication', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]

openshift_master_htpasswd_users={'admin': '$apr1$4qJTE7h9$j9zVzh43pFMjaCa/wuVlY.', 'developer': '$apr1$VtHG.FnT$b0XJ3355yxtDzqtiwb7Ag/' }

os_firewall_use_firewalld=true

openshift_hosted_registry_cert_expire_days=3650
openshift_ca_cert_expire_days=5475
openshift_node_cert_expire_days=3650
openshift_master_cert_expire_days=3650
etcd_ca_default_days=5475

openshift_master_dynamic_provisioning_enabled=true

openshift_enable_unsupported_configurations=True

openshift_hosted_registry_storage_kind=nfs
openshift_hosted_registry_storage_access_modes=['ReadWriteMany']
openshift_hosted_registry_storage_host=nfs.openshift.local
openshift_hosted_registry_storage_nfs_directory=/exports
openshift_hosted_registry_storage_nfs_options='*(rw,root_squash)'
openshift_hosted_registry_storage_volume_name=registry
openshift_hosted_registry_storage_volume_size=40Gi

openshift_metrics_install_metrics=true
openshift_metrics_hawkular_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_metrics_cassandra_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_metrics_heapster_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_metrics_storage_kind=nfs
openshift_metrics_storage_access_modes=['ReadWriteOnce']
openshift_metrics_storage_host=nfs.openshift.local
openshift_metrics_storage_nfs_directory=/exports
openshift_metrics_storage_volume_name=metrics
openshift_metrics_storage_volume_size=20Gi

openshift_logging_install_logging=true
openshift_logging_storage_kind=nfs
openshift_logging_kibana_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_logging_curator_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_logging_es_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_logging_storage_access_modes=['ReadWriteOnce']
openshift_logging_storage_nfs_options='*(rw,root_squash)'
openshift_logging_storage_host=nfs.openshift.local
openshift_logging_storage_nfs_directory=/exports
openshift_logging_storage_volume_name=logging
openshift_logging_storage_volume_size=15Gi
openshift_logging_elasticsearch_storage_type=pvc
openshift_logging_es_pvc_size=15Gi
openshift_logging_es_pvc_storage_class_name=''
openshift_logging_es_pvc_dynamic=true
openshift_logging_es_pvc_prefix=logging
openshift_logging_es_memory_limit=2Gi

openshift_node_groups=[{'name': 'node-config-master-crio', 'labels': ['node-role.kubernetes.io/master=true', 'runtime=cri-o'], 'edits': [{ 'key': 'kubeletArguments.container-runtime','value': ['remote'] }, { 'key': 'kubeletArguments.container-runtime-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.image-service-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.runtime-request-timeout','value': ['10m'] }]},{'name': 'node-config-infra-crio', 'labels': ['node-role.kubernetes.io/infra=true', 'runtime=cri-o'], 'edits': [{ 'key': 'kubeletArguments.container-runtime','value': ['remote'] }, { 'key': 'kubeletArguments.container-runtime-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.image-service-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.runtime-request-timeout','value': ['10m'] }]},{'name': 'node-config-compute-crio', 'labels': ['node-role.kubernetes.io/compute=true', 'runtime=cri-o'], 'edits': [{ 'key': 'kubeletArguments.container-runtime','value': ['remote'] }, { 'key': 'kubeletArguments.container-runtime-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.image-service-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.runtime-request-timeout','value': ['10m'] }]},{'name': 'node-config-master-infra-crio', 'labels': ['node-role.kubernetes.io/infra=true', 'node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/compute=true', 'runtime=cri-o'], 'edits': [{ 'key': 'kubeletArguments.container-runtime','value': ['remote'] }, { 'key': 'kubeletArguments.container-runtime-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.image-service-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.runtime-request-timeout','value': ['10m'] }]},{'name': 'node-config-all-in-one-crio', 'labels': ['node-role.kubernetes.io/infra=true', 'node-role.kubernetes.io/compute=true' ,'node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/compute=true', 'runtime=cri-o'], 'edits': [{ 'key': 'kubeletArguments.container-runtime','value': ['remote'] }, { 'key': 'kubeletArguments.container-runtime-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.image-service-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.runtime-request-timeout','value': ['10m'] }]},{'name': 'node-config-all-in-one', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/infra=true', 'node-role.kubernetes.io/compute=true']}, {'name': 'node-config-master-infra', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/infra=true']}, {'name': 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=true']}, {'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true']}, {'name': 'node-config-infra', 'labels': ['node-role.kubernetes.io/infra=true']}, {'name': 'node-config-compute-crio-prod', 'labels': ['node-role.kubernetes.io/compute=true', 'runtime=cri-o', 'node-role.kubernetes.io/environment=prod'], 'edits': [{ 'key': 'kubeletArguments.container-runtime','value': ['remote'] }, { 'key': 'kubeletArguments.container-runtime-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.image-service-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.runtime-request-timeout','value': ['10m'] }]}, {'name': 'node-config-compute-crio-stage', 'labels': ['node-role.kubernetes.io/compute=true', 'runtime=cri-o', 'node-role.kubernetes.io/environment=stage', 'node-role.kubernetes.io/build=true'], 'edits': [{ 'key': 'kubeletArguments.container-runtime','value': ['remote'] }, { 'key': 'kubeletArguments.container-runtime-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.image-service-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.runtime-request-timeout','value': ['10m'] }]}]

osm_default_node_selector='node-role.kubernetes.io/environment=stage'
openshift_builddefaults_nodeselectors={'node-role.kubernetes.io/build': 'true'}

[nfs]
nfs.openshift.local

[etcd]
master1.openshift.local
master2.openshift.local
master3.openshift.local

[masters]
master1.openshift.local
master2.openshift.local
master3.openshift.local

[lb]
lb.openshift.local

[nodes]
master[1:3].openshift.local openshift_node_group_name='node-config-master-infra-crio'
node[1:2].openshift.local openshift_node_group_name='node-config-compute-crio-prod'
node[3:4].openshift.local openshift_node_group_name='node-config-compute-crio-stage'
uselessidbr commented 4 years ago

I noticed it starts to happen as soon as elastic search roll out.

If I stop ES pod the fluentd containers starts to run again. Is it something related to ES memory limit?

uselessidbr commented 4 years ago

It seems the fluentd image is broken. I downgraded it to image version tag v3.10 and the container is now running. I’m not sure what is the github repository responsible for this image.

uselessidbr commented 4 years ago

As I have another setup that is running with no issues I discovered the image digest and downloaded it to my new system (where the pods are crashing). After that I changed the daemoset to the "v3.10" image tag so I can delete the image with the tag v3.11 and use this tag with the one I've pulled.

After that I changed back the daemonset to v3.11 tag again and deleted the v3.10 image.

install podman as im using crio and need it to change image tag

yum install -y podman

pull the working image

crictl pull openshift/origin-logging-fluentd@sha256:c26a9c143c76e32982c4acb520b9f468b303e5df4f0f8873364acae0f5083e0a

*** Now you must change the daemonset to use image with the "v3.10" tag and wait to all pods to terminate so you are able to the delete the v3.11 tag image as it is not in use

delete the images which are not working properly

[root@master1 ~]# crictl images | grep fluentd docker.io/openshift/origin-logging-fluentd v3.11 33b86066482f9 480MB [root@master1 ~]# crictl rmi 33b86066482f9

change the tag as the image isn't downloaded with a valid tag as crictl doesn't support tag+digest in this version

podman tag 33b86066482f9 docker.io/openshift/origin-logging-fluentd:v3.11

*** Now you must change the daemonset back to use image with the "v3.11" tag and wait to all pods to terminate so you are able to the delete the "v3.10" tag image as it is not in use

[root@master1 ~]# crictl images | grep fluentd docker.io/openshift/origin-logging-fluentd v3.10 33b86066482f9 480MB [root@master1 ~]# crictl rmi 33b86066482f9

Finally its working :)

kobusvdm commented 4 years ago

I encountered this error when I restarted fluentd on OKD 3.11 and a new image was pulled (tag v3.11 was recently updated).

I was able to workaround it by adding METRICS_CERT and METRICS_KEY to the DeamonSet definition, either after deployment, or before in the template at roles/openshift_logging_fluentd/templates/fluentd.j2

It seems that the following change is involved: https://github.com/openshift/origin-aggregated-logging/pull/1565

This is only a workaround though. Someone involved in the change above should be able to give more insight for a proper fix.

uselessidbr commented 4 years ago

I encountered this error when I restarted fluentd on OKD 3.11 and a new image was pulled (tag v3.11 was recently updated).

I was able to workaround it by adding METRICS_CERT and METRICS_KEY to the DeamonSet definition, either after deployment, or before in the template at roles/openshift_logging_fluentd/templates/fluentd.j2

It seems that the following change is involved: openshift/origin-aggregated-logging#1565

This is only a workaround though. Someone involved in the change above should be able to give more insight for a proper fix.

It could be related but this pull request is from 10 months ago and the image was working at least until 21 Jan. Somepoint after that the image broke.

I see they merged it about 10 months ago, I don’t know if it should be the cause.

hirsaeki-mki commented 4 years ago

@uselessidbr Thank you for your information. I could run openshift-logging-fluentd with following commands at master.

oc login -u system:admin
oc -n openshift-logging set image daemonset logging-fluentd *=openshift/origin-logging-fluentd@sha256:c26a9c143c76e32982c4acb520b9f468b303e5df4f0f8873364acae0f5083e0a
gialloguitar commented 4 years ago

I have same trouble (OKD 3.11), only 3.10 image does work success

jcantrill commented 4 years ago

We are investigating but you can always work around the issue by building the image yourself from:

https://github.com/openshift/origin-aggregated-logging/blob/release-3.11/fluentd/Dockerfile.centos7

$fluentdir>docker build -t openshift/logging-fluentd:v3.11 -f Dockerfile.centos7 .

jcantrill commented 4 years ago

Ref https://github.com/openshift/origin-aggregated-logging/issues/1823

uselessidbr commented 4 years ago

The latest image had the same digest as 3.11 and 4.x tag, just make sure they are decoupled to grant that 3.11 will not be updated when latest is updated.

arunabhabanerjee commented 4 years ago

oc -n openshift-logging set image daemonset logging-fluentd *=openshift/origin-logging-fluentd@sha256:c26a9c143c76e32982c4acb520b9f468b303e5df4f0f8873364acae0f5083e0a

It worked for me as well. Thanks you.

uselessidbr commented 4 years ago

Does the problem persist in new deployments? The v3.11 image digest still the same as v4.0 and latest.

Em dom, 9 de fev de 2020 às 14:11, Arunabha Banerjee < notifications@github.com> escreveu:

oc -n openshift-logging set image daemonset logging-fluentd *=openshift/origin-logging-fluentd@sha256 :c26a9c143c76e32982c4acb520b9f468b303e5df4f0f8873364acae0f5083e0a

It worked for me as well. Thanks you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openshift/openshift-ansible/issues/12089?email_source=notifications&email_token=AK5OIZ7W2H6UA3IP55X6G7DRCA2LPA5CNFSM4KLS4XK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELGSGYI#issuecomment-583869281, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK5OIZYNB3KZDIGMTES6GZTRCA2LPANCNFSM4KLS4XKQ .