Closed DanaEHI closed 3 years ago
When using the current version (where it doesn't send anything to Splunk), the messages are getting buffered and the fluentd queue becomes overwhelmed and we get alerts.
Example:
Annotations
message = In the last minute, fluentd <redacted> buffer queue length increased more than 32. Current value is 216.
summary = Fluentd is overwhelmed
@DanaEHI Can you check to see if there are any logs in the OpenShift fluentd pods?
@sabre1041 Here are the errors in fluentd pods:
2021-03-14 15:41:53 +0000 [warn]: failed to flush the buffer. retry_time=2 next_retry_seconds=2021-03-14 15:41:56 +0000 chunk="5bd80e1b7f2090a7ae49c527c319bf59" error_class=Errno::ENOENT error="No such file or directory @ rb_sysopen - /var/run/ocp-collector/secrets/openshift-logforwarding-splunk/tls.crt"
2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin_helper/socket.rb:154:in `read'
2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin_helper/socket.rb:154:in `socket_create_tls'
2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward.rb:352:in `create_transfer_socket'
2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward/connection_manager.rb:46:in `call'
2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward/connection_manager.rb:46:in `connect'
2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward.rb:732:in `connect'
2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward.rb:606:in `send_data'
2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward.rb:336:in `block in write'
2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward/load_balancer.rb:46:in `block in select_healthy_node'
2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward/load_balancer.rb:37:in `times'
2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward/load_balancer.rb:37:in `select_healthy_node'
2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward.rb:336:in `write'
2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/output.rb:1125:in `try_flush'
2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/output.rb:1431:in `flush_thread_run'
2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/output.rb:461:in `block (2 levels) in start'
2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
Here is my values file:
openshift:
logging:
namespace: openshift-logging
elasticsearch:
port: 9200
forwarding:
audit:
elasticsearch: false
splunk: false
app:
elasticsearch: false
splunk: true
infra:
elasticsearch: false
splunk: true
forwarding:
fluentd:
port: 24224
sharedkey: splunkforwarding
passphrase: ""
ssl: true
caFile: files/default-openshift-logging-fluentd.crt
keyFile: files/default-openshift-logging-fluentd.key
loglevel: warn
replicas: 2
# Set to true when version <4.6
scl: false
persistence:
enabled: false
size: 5Gi
## If defined, storageClassName: <storageClass>
## If set to "-", storageClassName: "", which disables dynamic provisioning
## If undefined (the default) or set to null, no storageClassName spec is
## set, choosing the default provisioner. (gp2 on AWS, standard on
## GKE, AWS & OpenStack)
##
# storageClass: "-"
storageClass: ""
accessMode: ReadWriteOnce
image: registry.redhat.io/openshift4/ose-logging-fluentd:v4.6
nodeSelector: {}
tolerations: []
affinity: {}
resources:
requests:
cpu: 100m
memory: 512Mi
limits:
cpu: 500m
memory: 1024Mi
updateStrategy:
type: "RollingUpdate"
buffer:
"@type": memory
chunk_limit_records: 100000
chunk_limit_size: 200m
flush_interval: 5s
flush_thread_count: 1
overflow_action: block
retry_max_times: 3
total_limit_size: 600m
# Example configuration to support file based buffering
# "@type": file
# path: /var/log/fluentd/fluentd-buffers/buffer
# flush_mode: interval
# retry_type: exponential_backoff
# flush_thread_count: 2
# flush_interval: "5s"
# retry_forever:
# retry_max_interval: 30
# chunk_limit_size: "200m"
# total_limit_size: "600m"
# chunk_limit_records: 100000
# overflow_action: block
splunk:
# Specify Splunk HEC Token and Index
token: ...
index: ...
protocol: http
hostname: ...
port: 80
insecure: true
sourcetype: openshift_logs
source: openshift
# Specify the CA Certificate for Splunk
# caFile: "files/splunk-ca.crt"
It was most likely caused by this commit to handle changes in 4.7.
I'll work on adding conditional logic to address <4.7 and will follow up with a resolution.
Thank you @sabre1041 - we are seeing logs in Splunk now! However, they appear to be at the Info level, even though I've set loglevel: warn
in Values. Is that the right place to filter the logs? Or would I need to edit the ConfigMap and add in another <filter **> section?
@sabre1041 that configuration is for the logging of the forwarder itself, and not the messages that it is forwarding
Thank you for the extra clarification! This is working with 4.6.12.
We are on OpenShift 4.6 and are currently using the November release of this chart (the newest version doesn't send anything to Splunk for us).
Setting the ConfigMap
fluentd-loglevel: warn
continues to send Info messages to Splunk: e.g.I deleted the fluentd pods and openshift-logforwarding-splunk-# pods, but the messages being sent are still at the info level (20k+ per minute).
I attempted to use the new "for 4.6" version, but nothing sent to Splunk at all - I verified my HEC and Index were correct in the values file, and that I had at least one category set to send to Splunk. After installing the previous version, 'buffered' logs were successfully sent to Splunk (ones collected during the time the "4.6" version was installed)
I don't see any messages in the openshift-logforwarding-splunk-# pods, and verified that messages are still being sent to kibana during the times when nothing is going to Splunk.