Closed drbwa closed 3 years ago
@drbwa we see in Kibana pod logs ( openshift-logging/kibana-79884d6c4f-9lsqs ) ::
cc @HumairAK @4n4nd @anishasthana
@eranra you're right. I think the operator did something and rolled out a new deployment of the elastic search. There's currently no service called (seemed to be a glitch on my side, I saw no services for a bit, now they are present)elasticsearch
in the namespace, hence the routing is broken...
elasticsearch-cdm-293ha663-1-587675b879-d9rk8
pod's proxy
container exploded with:
time="2021-02-09T16:04:23Z" level=info msg="Handling request \"authorization\""
and one of 9 fluentd
pods is in failed state due to:
/opt/rh/rh-ruby25/root/usr/share/ruby/net/protocol.rb:44:in `connect_nonblock': SSL_connect returned=1 errno=0 state=error: certificate verify failed (unable to get local issuer certificate) (OpenSSL::SSL::SSLError)
from /opt/rh/rh-ruby25/root/usr/share/ruby/net/protocol.rb:44:in `ssl_socket_connect'
from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:985:in `connect'
from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:920:in `do_start'
from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:909:in `start'
from /opt/rh/rh-ruby25/root/usr/share/ruby/net/http.rb:609:in `start'
from wait_for_es_version.rb:26:in `<main>'
other fluentd
pods are healthy.
The operator conditions:
This sounds to me like there was a new rollout?
sounds to me like there was a new rollout?
yes, there was an automated one.
Action item: change subscription to be manual
Seems like the cluster logging has some networking issues. The fluend
and elasticsearch
proxy both points to a network related issue.
openshift-monitoring Grafana doesn't hint anything unusual, cluster networking is not effected.
ES proxy tls handshake error for the record:
2021/02/09 16:19:49 http: TLS handshake error from 10.130.3.10:60638: tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "openshift-cluster-logging-signer")
Let's try this? https://access.redhat.com/solutions/5347071
I don't understand the cause though.. yeah, why not restart it. Go for it! :+1:
Kibana is up, in yellow state
fluentd-k8dnx is still crashing
Kibana's new state:
fluentd-qd92x now failing with following logs:
2021-02-09 16:41:52 +0000 [warn]: suppressed same stacktrace
2021-02-09 16:41:52 +0000 [warn]: [retry_clo_default_output_es] failed to flush the buffer. retry_time=5 next_retry_seconds=2021-02-09 16:42:07 +0000 chunk="5ba7a892cdd0eae1a2b510a907d64b27" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): [500] {\"code\":500,\"message\":\"Internal Error\",\"error\":{}}\n"
2021-02-09 16:41:52 +0000 [warn]: suppressed same stacktrace
Looks like CLO status showing all fluentd pods in ready state:
oc get clusterlogging instance
:
status:
collection:
logs:
fluentdStatus:
daemonSet: fluentd
nodes:
fluentd-8rd6q: os-mgr-0
fluentd-9j2l6: os-mgr-1
fluentd-294tl: os-mgr-2
fluentd-k8dnx: os-wrk-2
fluentd-qd92x: os-sto-1
fluentd-qhrpw: os-sto-0
fluentd-szc5q: os-wrk-1
fluentd-tz8vf: os-sto-2
fluentd-wcjdv: os-wrk-0
pods:
failed: []
notReady: []
ready:
- fluentd-294tl
- fluentd-8rd6q
- fluentd-9j2l6
- fluentd-k8dnx
- fluentd-qd92x
- fluentd-qhrpw
- fluentd-szc5q
- fluentd-tz8vf
- fluentd-wcjdv
Restarted Kibana, it is once again in yellow and Setting up index template
Kibana's still yellow. Maybe we should check on the PVC, if it's not full. I have no time for it anymore today though.
Doesn't look full:
$ oc rsh elasticsearch-cdm-293ha663-1-6dc9cdb878-fcwwj
sh-4.2$ df -h
Filesystem Size Used Avail Use% Mounted on
overlay 372G 148G 224G 40% /
tmpfs 64M 0 64M 0% /dev
tmpfs 63G 0 63G 0% /sys/fs/cgroup
shm 64M 0 64M 0% /dev/shm
tmpfs 63G 104M 63G 1% /etc/passwd
172.30.131.189:6789,172.30.233.37:6789,172.30.245.29:6789:/volumes/csi/csi-vol-0b95f4d1-66fb-11eb-87f5-22985d810218/adbb7ac7-0f64-4ee6-841a-718c649f8867 187G 153G 35G 82% /elasticsearch/persistent
/dev/mapper/coreos-luks-root-nocrypt 372G 148G 224G 40% /etc/hosts
tmpfs 63G 28K 63G 1% /etc/openshift/elasticsearch/secret
tmpfs 63G 32K 63G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 63G 0 63G 0% /proc/acpi
tmpfs 63G 0 63G 0% /proc/scsi
tmpfs 63G 0 63G 0% /sys/firmware
Interesting, from the kibana container logs:
{"type":"log","@timestamp":"2021-02-09T16:50:09Z","tags":["warning","migrations"],"pid":121,"message":"Another Kibana instance appears to be migrating the index. Waiting for that migration to complete. If no other Kibana instance is attempting migrations, you can get past this message by deleting index .kibana_-343928101_ronenschaffergmailcom_2 and restarting Kibana."}
Deleting only .kibana_-343928101_ronenschaffergmailcom_2
resulted in another kibana user index resulting in the same error blocking migration. Deleting all the indices seems to have allowed Kibana to get back to green state.
Note that users will have to set up their index patterns again.
Thanks to @anishasthana for providing guidance on using the Elasticsearch SA token to get successful curl calls to Elasticsearch API.
I was able to query kibana for logs :+1:
I encountered a few issues trying to run curl commands both via local host and following the guide provided by CLO here and using an exposed route.
In the future it would be helpful if we can get a proper exposed route set up whereby admins can diagnose / apply fixes via curl and/or external tools via cmd line.
Kibana having issues connecting to ES it seems:
@HumairAK I have been able to set up two index patterns in the Kibana dashboard (one on index*
and on app*
). However, when trying to retrieve logs for these indices (in the Discover
tab), the dashboard complains about a Gateway Timeout
:
SearchError: Gateway Timeout
at https://kibana-openshift-logging.apps.cnv.massopen.cloud/bundles/kibana.bundle.js:2:627663
at processQueue (https://kibana-openshift-logging.apps.cnv.massopen.cloud/built_assets/dlls/vendors.bundle.dll.js:316:199687)
at https://kibana-openshift-logging.apps.cnv.massopen.cloud/built_assets/dlls/vendors.bundle.dll.js:316:200650
at Scope.$digest (https://kibana-openshift-logging.apps.cnv.massopen.cloud/built_assets/dlls/vendors.bundle.dll.js:316:210412)
at Scope.$apply (https://kibana-openshift-logging.apps.cnv.massopen.cloud/built_assets/dlls/vendors.bundle.dll.js:316:213219)
at done (https://kibana-openshift-logging.apps.cnv.massopen.cloud/built_assets/dlls/vendors.bundle.dll.js:316:132717)
at completeRequest (https://kibana-openshift-logging.apps.cnv.massopen.cloud/built_assets/dlls/vendors.bundle.dll.js:316:136329)
at XMLHttpRequest.requestLoaded (https://kibana-openshift-logging.apps.cnv.massopen.cloud/built_assets/dlls/vendors.bundle.dll.js:316:135225)
Intermittent connectivity issues to ES?
Thanks for the update @drbwa ! I believe this has to do with ES getting full, we have a PR currently in the works that will reduce the retention for now while we work on getting a long term storage in place. Will update accordingly.
As a little aside, @HumairAK do you know how much storage is currently available to ES? We (our project) will want to keep historic records of logs around and while we may probably want to do this via ETL to object storage, would be interesting to know what the capacity we have for ES is like.
@drbwa currently it has a 200Gi PVC
We're planning on some more long term storage outside of the cluster https://github.com/operate-first/apps/issues/232
We decided to re-install the CLO operator and delete the old PVCs, hence losing the data upto this point (~5 days worth). This was due to various complications, that we surmise occurred as a result of the operator update.
Given the current rate of ingest, we were capping out on our pvc storage (of 200gigs) with ~4-5 days of logs. We updated the node count to 2, resulting in 2 pvcs, each with ~300Gigs of pvc. Retention is still 7days for now, but we will monitor how many days we can squeeze from this much storage while we look for a more permanent persistent storage.
We also noticed that fluentd pods were constantly getting 500 errors, e.g:
2021-02-10 21:06:06 +0000 [warn]: [clo_default_output_es] failed to flush the buffer. retry_time=16 next_retry_seconds=2021-02-10 21:10:42 +0000 chunk="5bb0173136d751d8d87af4f32d80f274" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): [500] {\"code\":500,\"message\":\"Internal Error\",\"error\":{}}\n"
2021-02-10 21:06:06 +0000 [warn]: suppressed same stacktrace
We deduced this was a result of ES pods not having enough CPU, thus we increased the cpu count to 6 per pod. We are now seeing logs show up a bit quicker on ES. Once again we will monitor and adjust this value accordingly if we see an increasing delay on the rate of ingest.
Kibana is up and running right now.
We found that Kibana had old indexes stored (.kibana*
) that kibana was trying to migrate on every restart. But it would fail and the Kibana would end up in a non functional state. So we decided to delete the old PVCs to get rid of the problematic indexes.
We also found that starting Kibana before elasticsearch was ready led to indexing issues for Kibana.
Both Kibana and ElasticSearch are stabilized now. The PVC utilization is low.
Thank you @4n4nd and @HumairAK for the reinstall, if was by far the easiest solution :+1:
Post mortem recap:
Closing this issue now, since the problem was resolved. If it appears again, we know where to look at.
Describe the bug This is probably less of a bug than an issue or incident ticket.
Kibana is currently not available.
To Reproduce Steps to reproduce the behavior:
Kibana server is not ready yet
.Expected behaviour See the Kibana dashboard.
/cc @eranra