openshift / origin-aggregated-logging

139 stars 230 forks source link

Intermittent failure to collect logs through EFK #1740

Closed benscottub closed 4 years ago

benscottub commented 5 years ago

We have an origin 3.11 cluster with 3 masters and 3 nodes with EFK deployed in the openshift-logging namespace. We are noticing intermittent issues of logs failing to make their way from the container into elasticsearch. It happens across all 3 nodes at different times. I'm using the following simple script to replicate the issue by generating a pod on all 3 nodes, generating 5 log lines per pod, waiting a short time, and querying elasticsearch to make sure all 15 log lines are present. About a quarter of the time, only 2 of the 3 pods logs show up in elasticsearch.

mark=$(md5sum -t <<< `date` | cut -d' ' -f1)

# start a pod on each node
for h in 'osn1' 'osn2' 'osn3'; do
    oc run log-test-$h-$mark --restart Never --rm --attach --image alpine --overrides '
      {
        "spec": {
          "nodeName": "'"$h"'.mydomain.net",
          "containers": [
            {
              "name": "log-test-'"$h"'-'"$mark"'",
              "image": "alpine",
              "command": [ "sh", "-c", "for i in $(seq 5); do echo $(date) $HOSTNAME $i; sleep 5; done" ],
            }
          ]
        }
      }
    ' &
done

# wait for pods to finish
wait

# give a little extra time for logs to send
sleep 30

# query logs for messages matching pods with unique hash name
hits=$(
  curl -fsS -b ~/.session \
    "https://logs.mydomain.net/elasticsearch/project.$(oc project -q).*/_search?q=kubernetes.pod_name:log-test-*-$mark" \
    | jq .hits.total
)

# assert there are 15 messages captured
test "$hits" = 15

There are no errors in the fluentd logs on any of the nodes even after I redeployed the daemonset with VERBOSE=true and DEBUG=true. Are there additional ways that we can go about debugging or troubleshooting an intermittent failure like this? Thank you for any advice you can give.

jcantrill commented 5 years ago
  1. Consider checking fluent's logs [1] which are likely being written disk.
  2. Also, please provide resource information for both ES and fluent.

[1] https://github.com/openshift/origin-aggregated-logging/tree/release-3.11/fluentd#configuration

benscottub commented 5 years ago

Hello, thanks for the quick response. The fluent logs are being written to disk, but there are no warnings or errors in the log since yesterday (and I have had several failures this morning). What resource information can I provide for ES and fluent?

richm commented 5 years ago

Is this a problem that is new? That is, have you been running logging with no drops, and something changed, and now you have drops? If so, what changed? If not, is this the first time you have run a test with "tracers" to see if all logs are getting through?

benscottub commented 5 years ago

We became aware of the issue when we started running this tracing script. I did more testing after opening this ticket and found that when I have my testing pods send 20 log lines instead of 5 (so the pods are alive for 100 seconds instead of only 25 seconds) I have 100% success rate in logging. So it seems that shorter lived pods sometimes don't get picked up by fluentd or elasticsearch.

Several times I ran the 100 second test and watched the corresponding elasticsearch index while the pod was generating logs and I didn't see any log entries show up in the index for over 60 seconds. Is this kind of latency/delay normal or do I have some kind of load/configuration issue that causes such a long delay?

benscottub commented 5 years ago

One additional data point -- I changed my testing script to only generate 2 log lines (pod lifetime: 10 seconds) before terminating, and then I queried elasticsearch 100 seconds later and one of the pods did not send its logs to ES.

richm commented 5 years ago

We became aware of the issue when we started running this tracing script. I did more testing after opening this ticket and found that when I have my testing pods send 20 log lines instead of 5 (so the pods are alive for 100 seconds instead of only 25 seconds) I have 100% success rate in logging. So it seems that shorter lived pods sometimes don't get picked up by fluentd or elasticsearch.

Several times I ran the 100 second test and watched the corresponding elasticsearch index while the pod was generating logs and I didn't see any log entries show up in the index for over 60 seconds. Is this kind of latency/delay normal or do I have some kind of load/configuration issue that causes such a long delay?

Yes, this kind of latency/delay is "normal" - I mean, you might consider it to be a bug, but it is a consequence of the fluentd in_tail parameter defaults (https://docs.fluentd.org/v/0.12/input/tail#refresh_interval and https://docs.fluentd.org/v/0.12/input/tail#rotate_wait) and unfortunately we do not have the ability to set these in 3.11 (https://github.com/openshift/origin-aggregated-logging/pull/1737 and https://github.com/openshift/origin-aggregated-logging/pull/1733 et. al.). In practice, with long-lived pods, it isn't a problem.

Do you have a requirement to process logs from lots of short-lived pods?

benscottub commented 5 years ago

Rich, once again thanks for the valuable information. This cluster will definitely be running some batch-job style workloads which are short running, but I don't know exactly what their duration is going to be. I'll have to talk to my developers to see what they plan to do.

As a workaround to this, I could add a 60 second sleep to the end of any of these short running pods to make sure that they are alive past the in_tail refresh interval of 60 seconds to ensure that fluentd picks up the pod's logs. Am I understanding that correctly?

richm commented 5 years ago

Rich, once again thanks for the valuable information. This cluster will definitely be running some batch-job style workloads which are short running, but I don't know exactly what their duration is going to be. I'll have to talk to my developers to see what they plan to do.

As a workaround to this, I could add a 60 second sleep to the end of any of these short running pods to make sure that they are alive past the in_tail refresh interval of 60 seconds to ensure that fluentd picks up the pod's logs. Am I understanding that correctly?

Yes.

jcantrill commented 4 years ago

Closing due to age. Please reopen if appropriate