Closed benscottub closed 4 years ago
[1] https://github.com/openshift/origin-aggregated-logging/tree/release-3.11/fluentd#configuration
Hello, thanks for the quick response. The fluent logs are being written to disk, but there are no warnings or errors in the log since yesterday (and I have had several failures this morning). What resource information can I provide for ES and fluent?
Is this a problem that is new? That is, have you been running logging with no drops, and something changed, and now you have drops? If so, what changed? If not, is this the first time you have run a test with "tracers" to see if all logs are getting through?
We became aware of the issue when we started running this tracing script. I did more testing after opening this ticket and found that when I have my testing pods send 20 log lines instead of 5 (so the pods are alive for 100 seconds instead of only 25 seconds) I have 100% success rate in logging. So it seems that shorter lived pods sometimes don't get picked up by fluentd or elasticsearch.
Several times I ran the 100 second test and watched the corresponding elasticsearch index while the pod was generating logs and I didn't see any log entries show up in the index for over 60 seconds. Is this kind of latency/delay normal or do I have some kind of load/configuration issue that causes such a long delay?
One additional data point -- I changed my testing script to only generate 2 log lines (pod lifetime: 10 seconds) before terminating, and then I queried elasticsearch 100 seconds later and one of the pods did not send its logs to ES.
We became aware of the issue when we started running this tracing script. I did more testing after opening this ticket and found that when I have my testing pods send 20 log lines instead of 5 (so the pods are alive for 100 seconds instead of only 25 seconds) I have 100% success rate in logging. So it seems that shorter lived pods sometimes don't get picked up by fluentd or elasticsearch.
Several times I ran the 100 second test and watched the corresponding elasticsearch index while the pod was generating logs and I didn't see any log entries show up in the index for over 60 seconds. Is this kind of latency/delay normal or do I have some kind of load/configuration issue that causes such a long delay?
Yes, this kind of latency/delay is "normal" - I mean, you might consider it to be a bug, but it is a consequence of the fluentd in_tail parameter defaults (https://docs.fluentd.org/v/0.12/input/tail#refresh_interval and https://docs.fluentd.org/v/0.12/input/tail#rotate_wait) and unfortunately we do not have the ability to set these in 3.11 (https://github.com/openshift/origin-aggregated-logging/pull/1737 and https://github.com/openshift/origin-aggregated-logging/pull/1733 et. al.). In practice, with long-lived pods, it isn't a problem.
Do you have a requirement to process logs from lots of short-lived pods?
Rich, once again thanks for the valuable information. This cluster will definitely be running some batch-job style workloads which are short running, but I don't know exactly what their duration is going to be. I'll have to talk to my developers to see what they plan to do.
As a workaround to this, I could add a 60 second sleep to the end of any of these short running pods to make sure that they are alive past the in_tail refresh interval of 60 seconds to ensure that fluentd picks up the pod's logs. Am I understanding that correctly?
Rich, once again thanks for the valuable information. This cluster will definitely be running some batch-job style workloads which are short running, but I don't know exactly what their duration is going to be. I'll have to talk to my developers to see what they plan to do.
As a workaround to this, I could add a 60 second sleep to the end of any of these short running pods to make sure that they are alive past the in_tail refresh interval of 60 seconds to ensure that fluentd picks up the pod's logs. Am I understanding that correctly?
Yes.
Closing due to age. Please reopen if appropriate
We have an origin 3.11 cluster with 3 masters and 3 nodes with EFK deployed in the openshift-logging namespace. We are noticing intermittent issues of logs failing to make their way from the container into elasticsearch. It happens across all 3 nodes at different times. I'm using the following simple script to replicate the issue by generating a pod on all 3 nodes, generating 5 log lines per pod, waiting a short time, and querying elasticsearch to make sure all 15 log lines are present. About a quarter of the time, only 2 of the 3 pods logs show up in elasticsearch.
There are no errors in the fluentd logs on any of the nodes even after I redeployed the daemonset with VERBOSE=true and DEBUG=true. Are there additional ways that we can go about debugging or troubleshooting an intermittent failure like this? Thank you for any advice you can give.