Closed kenankule closed 3 years ago
Hi @kenankule sorry for the late response. Could you please check it with the following fluend image ? v1.11.5-alpine-18
I get the following error with v1.11.5-alpine-18
:
1: from /usr/lib/ruby/2.7.0/rubygems/specification.rb:1369:in `activate'
/usr/lib/ruby/2.7.0/rubygems/specification.rb:2247:in `raise_if_conflicts': Unable to activate fluent-plugin-prometheus-2.0.1, because prometheus-client-0.9.0 conflicts with prometheus-client (>= 2.1.0) (Gem::ConflictError)
@kenankule thanks for the quick answer! Could you please check again the image tag version? sorry for that but this error looks like something that we had already solved. :(
Sample fluentd config:
#sample.conf
# Enable RPC endpoint (this allows to trigger config reload without restart)
<system>
rpc_endpoint 127.0.0.1:24444
log_level info
workers 1
</system>
# Prometheus monitoring
<source>
@type prometheus
port 24231
metrics_path /metrics
</source>
<source>
@type prometheus_monitor
</source>
<source>
@type prometheus_output_monitor
</source>
<source>
@type forward
@id main_forward
bind 0.0.0.0
port 24240
</source>
<match **>
@type label_router
@id main
metrics true
<route>
@label @stdoutlabel
metrics_labels {"id":"clusterflow:banzai-logging:stdout"}
<match>
negate false
</match>
</route>
</match>
<label @stdoutlabel>
<match **>
@type splunk_hec
@id clusterflow:banzai-logging:stdout:clusteroutput:banzai-logging:splunk-hec
hec_host my_hec_host
hec_token my_hec_token
index my_index
protocol https
source fluentd
<buffer []>
@type file
path /buffers/clusterflow:banzai-logging:stdout:clusteroutput:banzai-logging:splunk-hec.*.buffer
retry_forever true
timekey 10m
timekey_wait 10m
</buffer>
</match>
</label>
sample run:
docker run -ti -v `pwd`/sample.conf:/tmp/fluentd.conf banzaicloud/fluentd:v1.11.5-alpine-18 --dry-run -v -c /tmp/fluentd.conf
Thanks! We'll try to solve this as soon as we can.
Ah! the image in ghcr.io with the same tag (fluentd:v1.11.5-alpine-18) worked! Maybe docker.io image is not the same as ghcr.io image.
docker run -ti -v `pwd`/bundle.conf:/tmp/fluentd.conf ghcr.io/banzaicloud/fluentd:v1.11.5-alpine-18 --dry-run -v -c /tmp/fluentd.conf
...
2021-04-30 20:47:04 +0000 [info]: fluent/log.rb:329:info: finished dry run mode
This is strange I'll check it thanks!
Hi @kenankule, could you please confirm if this version is working fine now (not just in dry mode) ? Btw this is our fix in the plugin, and if its working fine we'll try to contribute it back. Thanks for your support.
I've updated the logging object with the fluentd tag v1.11.5-alpine-18
.
The operator has been deployed using helm chart 3.9.2
initially.
I hope that's a good enough test because i don't have an environment to try a fresh install.
I'll run it for a couple for hours and check the prometheus metrics to see if its ok and update the issue.
Great Thanks!
After some hours of testing, I've seen that there are some log files created under /tmp/fluent/backup/worker0
Looking at the amount of logs sent to splunk, i believe the logs are sent successfully but they are still stored in the /tmp/fluent/backup/worker0
folder in log files that are created every minutes. I have the feeling that even the chunks sent successfully to splunk are marked as "not successful" or smth.
Grafana dashboard looks fine, so the metrics are working. Is there anything else i could investigate? I've checked another cluster running the alpine-11 version and that fluentd does not seem to write logs to the tmp folder.
I think i was able to reproduce the error on a local setup. I'm running the new image against a splunk hec (locally) and i get the following error
2021-05-03 21:37:07 +0000 [warn]: #0 [clusterflow:banzai-logging:stdout:clusteroutput:banzai-logging:splunk-hec] got unrecoverable error in primary and no secondary error_class=ArgumentError error="unknown keywords: :type, :plugin_id, :status"
2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/prometheus-client-2.1.0/lib/prometheus/client/counter.rb:13:in `increment'
2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluent-plugin-splunk-hec-1.2.5/lib/fluent/plugin/out_splunk.rb:156:in `process_response'
2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluent-plugin-splunk-hec-1.2.5/lib/fluent/plugin/out_splunk_hec.rb:332:in `write_to_splunk'
2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluent-plugin-splunk-hec-1.2.5/lib/fluent/plugin/out_splunk.rb:100:in `block in write'
2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/2.7.0/benchmark.rb:308:in `realtime'
2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluent-plugin-splunk-hec-1.2.5/lib/fluent/plugin/out_splunk.rb:99:in `write'
2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.11.5/lib/fluent/compat/output.rb:131:in `write'
2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.11.5/lib/fluent/plugin/output.rb:1136:in `try_flush'
2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.11.5/lib/fluent/plugin/output.rb:1442:in `flush_thread_run'
2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.11.5/lib/fluent/plugin/output.rb:462:in `block (2 levels) in start'
2021-05-03 21:37:07 +0000 [warn]: #0 /usr/lib/ruby/gems/2.7.0/gems/fluentd-1.11.5/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2021-05-03 21:37:07 +0000 [warn]: #0 [clusterflow:banzai-logging:stdout:clusteroutput:banzai-logging:splunk-hec] bad chunk is moved to /tmp/fluent/backup/worker0/clusterflow_banzai-logging_stdout_clusteroutput_banzai-logging_splunk-hec/5c173bfc5e22c33477e93b6a1db4131b.log
Please see https://gist.github.com/kenankule/a8acfe0750992aa2daabdd0734649033 if you need to reproduce locally.
Any updates on the issue as it still exists in the latest fluentd helm char?!!!
We also can confirm that this issue is exists
I cannot reproduce this issue after Fluentd 1.13.3 upgrade.
I cannot reproduce this issue after Fluentd 1.13.3 upgrade.
Are you use new logging-operator ( release 3.14) with 1.13.3 fluentd or with 3.9.x (we are stuck on 3.9.x because of this issue)?
Sorry being brief, we've upgraded the logging-operator chart to 3.14.2. Currently in testing and i will be able to report the stability next week.
Upgrading the logging-operator helm chart to 3.14.2 solved the issue. Closing.
Describe the bug: If the operator is installed using the helm chart version 3.9.4, splunk-hec output plugin doesn't work due to an incompatibility between fluent-plugin-prometheus and fluent-plugin-splunk-hec.
The issue shows itself with the message "ConfigError error="Duplicated plugin id ..." in the fluentd-configcheck pods during the dry-run.
Expected behaviour:
Steps to reproduce the bug: Check expected behaviour section.
Additional context: The fluentd image used in the helm chart : ghcr.io/banzaicloud/fluentd:v1.11.5-alpine-12 has fluent-plugin-prometheus:v2.0.0 installed in it. fluentd-plugin-splunk-hec seems to be compatible with fluent-plugin-prometheus:v1.8.x There seems to be an issue reported in the fluentd-plugin-splunk-hec side : https://github.com/splunk/fluent-plugin-splunk-hec/issues/163
When i tried to downgrade to helm chart version 3.9.2, it works. I've not tried to use the latest chart version with fluentd image tag override. I've also tried to see the generated fluentd and if i remove the prometheus sections coming from input.conf, the dry-run is completed without a problem with the latest image (fluentd:v1.11.5-alpine-12)
Still, moving from 3.9.2 to 3.9.4 should not break a plugin.
Environment details:
/kind bug