kube-logging / logging-operator

Logging operator for Kubernetes
https://kube-logging.dev
Apache License 2.0
1.54k stars 329 forks source link

splunk-hec output doesn't work when prometheus metrics are enabled #712

Closed kenankule closed 3 years ago

kenankule commented 3 years ago

Describe the bug: If the operator is installed using the helm chart version 3.9.4, splunk-hec output plugin doesn't work due to an incompatibility between fluent-plugin-prometheus and fluent-plugin-splunk-hec.

The issue shows itself with the message "ConfigError error="Duplicated plugin id ..." in the fluentd-configcheck pods during the dry-run.

Expected behaviour:

Steps to reproduce the bug: Check expected behaviour section.

Additional context: The fluentd image used in the helm chart : ghcr.io/banzaicloud/fluentd:v1.11.5-alpine-12 has fluent-plugin-prometheus:v2.0.0 installed in it. fluentd-plugin-splunk-hec seems to be compatible with fluent-plugin-prometheus:v1.8.x There seems to be an issue reported in the fluentd-plugin-splunk-hec side : https://github.com/splunk/fluent-plugin-splunk-hec/issues/163

When i tried to downgrade to helm chart version 3.9.2, it works. I've not tried to use the latest chart version with fluentd image tag override. I've also tried to see the generated fluentd and if i remove the prometheus sections coming from input.conf, the dry-run is completed without a problem with the latest image (fluentd:v1.11.5-alpine-12)

Still, moving from 3.9.2 to 3.9.4 should not break a plugin.

Environment details:

/kind bug

ahma commented 3 years ago

Hi @kenankule sorry for the late response. Could you please check it with the following fluend image ? v1.11.5-alpine-18

kenankule commented 3 years ago

I get the following error with v1.11.5-alpine-18:

     1: from /usr/lib/ruby/2.7.0/rubygems/specification.rb:1369:in `activate'
/usr/lib/ruby/2.7.0/rubygems/specification.rb:2247:in `raise_if_conflicts': Unable to activate fluent-plugin-prometheus-2.0.1, because prometheus-client-0.9.0 conflicts with prometheus-client (>= 2.1.0) (Gem::ConflictError)
ahma commented 3 years ago

@kenankule thanks for the quick answer! Could you please check again the image tag version? sorry for that but this error looks like something that we had already solved. :(

kenankule commented 3 years ago

Sample fluentd config:

#sample.conf
# Enable RPC endpoint (this allows to trigger config reload without restart)
<system>
  rpc_endpoint 127.0.0.1:24444
  log_level info
  workers 1
</system>

# Prometheus monitoring

<source>
    @type prometheus
    port 24231
    metrics_path /metrics
</source>
<source>
    @type prometheus_monitor
</source>
<source>
    @type prometheus_output_monitor
</source>

<source>
  @type forward
  @id main_forward
  bind 0.0.0.0
  port 24240
</source>
<match **>
  @type label_router
  @id main
  metrics true
  <route>
    @label @stdoutlabel
    metrics_labels {"id":"clusterflow:banzai-logging:stdout"}
    <match>
      negate false
    </match>
  </route>
</match>
<label @stdoutlabel>
  <match **>
    @type splunk_hec
    @id clusterflow:banzai-logging:stdout:clusteroutput:banzai-logging:splunk-hec
    hec_host my_hec_host
    hec_token my_hec_token
    index my_index
    protocol https
    source fluentd
    <buffer []>
      @type file
      path /buffers/clusterflow:banzai-logging:stdout:clusteroutput:banzai-logging:splunk-hec.*.buffer
      retry_forever true
      timekey 10m
      timekey_wait 10m
    </buffer>
  </match>
</label>

sample run:

docker run -ti -v `pwd`/sample.conf:/tmp/fluentd.conf banzaicloud/fluentd:v1.11.5-alpine-18 --dry-run -v -c /tmp/fluentd.conf
ahma commented 3 years ago

Thanks! We'll try to solve this as soon as we can.

kenankule commented 3 years ago

Ah! the image in ghcr.io with the same tag (fluentd:v1.11.5-alpine-18) worked! Maybe docker.io image is not the same as ghcr.io image.

docker run -ti -v `pwd`/bundle.conf:/tmp/fluentd.conf ghcr.io/banzaicloud/fluentd:v1.11.5-alpine-18 --dry-run -v -c /tmp/fluentd.conf
...
2021-04-30 20:47:04 +0000 [info]: fluent/log.rb:329:info: finished dry run mode
ahma commented 3 years ago

This is strange I'll check it thanks!

ahma commented 3 years ago

Hi @kenankule, could you please confirm if this version is working fine now (not just in dry mode) ? Btw this is our fix in the plugin, and if its working fine we'll try to contribute it back. Thanks for your support.

kenankule commented 3 years ago

I've updated the logging object with the fluentd tag v1.11.5-alpine-18. The operator has been deployed using helm chart 3.9.2 initially. I hope that's a good enough test because i don't have an environment to try a fresh install. I'll run it for a couple for hours and check the prometheus metrics to see if its ok and update the issue.

ahma commented 3 years ago

Great Thanks!

kenankule commented 3 years ago

After some hours of testing, I've seen that there are some log files created under /tmp/fluent/backup/worker0 Looking at the amount of logs sent to splunk, i believe the logs are sent successfully but they are still stored in the /tmp/fluent/backup/worker0 folder in log files that are created every minutes. I have the feeling that even the chunks sent successfully to splunk are marked as "not successful" or smth.

Grafana dashboard looks fine, so the metrics are working. Is there anything else i could investigate? I've checked another cluster running the alpine-11 version and that fluentd does not seem to write logs to the tmp folder.

kenankule commented 3 years ago

I think i was able to reproduce the error on a local setup. I'm running the new image against a splunk hec (locally) and i get the following error

2021-05-03 21:37:07 +0000 [warn]: #0 [clusterflow:banzai-logging:stdout:clusteroutput:banzai-logging:splunk-hec] got unrecoverable error in primary and no secondary error_class=ArgumentError error="unknown keywords: :type, :plugin_id, :status"
  2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/prometheus-client-2.1.0/lib/prometheus/client/counter.rb:13:in `increment'
  2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluent-plugin-splunk-hec-1.2.5/lib/fluent/plugin/out_splunk.rb:156:in `process_response'
  2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluent-plugin-splunk-hec-1.2.5/lib/fluent/plugin/out_splunk_hec.rb:332:in `write_to_splunk'
  2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluent-plugin-splunk-hec-1.2.5/lib/fluent/plugin/out_splunk.rb:100:in `block in write'
  2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/2.7.0/benchmark.rb:308:in `realtime'
  2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluent-plugin-splunk-hec-1.2.5/lib/fluent/plugin/out_splunk.rb:99:in `write'
  2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.11.5/lib/fluent/compat/output.rb:131:in `write'
  2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.11.5/lib/fluent/plugin/output.rb:1136:in `try_flush'
  2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.11.5/lib/fluent/plugin/output.rb:1442:in `flush_thread_run'
  2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.11.5/lib/fluent/plugin/output.rb:462:in `block (2 levels) in start'
  2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.11.5/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2021-05-03 21:37:07 +0000 [warn]: #0 [clusterflow:banzai-logging:stdout:clusteroutput:banzai-logging:splunk-hec] bad chunk is moved to /tmp/fluent/backup/worker0/clusterflow_banzai-logging_stdout_clusteroutput_banzai-logging_splunk-hec/5c173bfc5e22c33477e93b6a1db4131b.log
kenankule commented 3 years ago

Please see https://gist.github.com/kenankule/a8acfe0750992aa2daabdd0734649033 if you need to reproduce locally.

alimravac commented 3 years ago

Any updates on the issue as it still exists in the latest fluentd helm char?!!!

RaSerge commented 3 years ago

We also can confirm that this issue is exists

kenankule commented 3 years ago

I cannot reproduce this issue after Fluentd 1.13.3 upgrade.

RaSerge commented 3 years ago

I cannot reproduce this issue after Fluentd 1.13.3 upgrade.

Are you use new logging-operator ( release 3.14) with 1.13.3 fluentd or with 3.9.x (we are stuck on 3.9.x because of this issue)?

kenankule commented 3 years ago

Sorry being brief, we've upgraded the logging-operator chart to 3.14.2. Currently in testing and i will be able to report the stability next week.

kenankule commented 3 years ago

Upgrading the logging-operator helm chart to 3.14.2 solved the issue. Closing.