kube-logging / logging-operator

Logging operator for Kubernetes
https://kube-logging.dev
Apache License 2.0
1.53k stars 326 forks source link

got unrecoverable error in primary and no secondary error_class=ArgumentError error="wrong number of arguments (given 4, expected 3)" #1716

Closed kefiras closed 2 months ago

kefiras commented 5 months ago

Describe the bug: Error when using syslog output

Expected behaviour: Logs should be sent to defined syslog cluster output

Steps to reproduce the bug: Configure below resource

apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterOutput
metadata:
  name: syslog
  namespace: logging
spec:
  syslog:
    buffer:
      timekey: 30s
      timekey_wait: 0s
    host: syslog.example.net
    insecure: true
    port: 20444
    transport: tls
apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterFlow
metadata:
  name: hosttailer-flow
  namespace: logging
spec:
  filters:
  - tag_normaliser: {}
  globalOutputRefs:
  - syslog
  match:
  - select:
      labels:
        app.kubernetes.io/name: host-tailer

Additional context: Fluentd throws errors:

2024-04-05 11:29:00 +0000 [warn]: #0 [clusterflow:logging:hosttailer-flow:clusteroutput:logging:syslog] got unrecoverable error in primary and no secondary error_class=ArgumentError error="wrong number of arguments (given 4, expected 3)"
  2024-04-05 11:29:00 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/plugin_helper/socket.rb:41:in `socket_create'
  2024-04-05 11:29:00 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-syslog_rfc5424-0.9.0.rc.8/lib/fluent/plugin/out_syslog_rfc5424.rb:65:in `find_or_create_socket'
  2024-04-05 11:29:00 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-syslog_rfc5424-0.9.0.rc.8/lib/fluent/plugin/out_syslog_rfc5424.rb:39:in `write'
  2024-04-05 11:29:00 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:1225:in `try_flush'
  2024-04-05 11:29:00 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
  2024-04-05 11:29:00 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
  2024-04-05 11:29:00 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2024-04-05 11:29:00 +0000 [warn]: #0 [clusterflow:logging:hosttailer-flow:clusteroutput:logging:syslog] bad chunk is moved to /buffers/backup/worker0/clusterflow_logging_hosttailer-flow_clusteroutput_logging_syslog/61557c1e8b4b20b9380467be5ff0a45b.log
2024-04-05 11:29:01 +0000 [warn]: #0 [clusterflow:logging:hosttailer-flow:clusteroutput:logging:syslog] got unrecoverable error in primary and no secondary error_class=ArgumentError error="wrong number of arguments (given 4, expected 3)"
  2024-04-05 11:29:01 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/plugin_helper/socket.rb:41:in `socket_create'
  2024-04-05 11:29:01 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-syslog_rfc5424-0.9.0.rc.8/lib/fluent/plugin/out_syslog_rfc5424.rb:65:in `find_or_create_socket'
  2024-04-05 11:29:01 +0000 [warn]: #0 /usr/local/bundle/gems/fluent-plugin-syslog_rfc5424-0.9.0.rc.8/lib/fluent/plugin/out_syslog_rfc5424.rb:39:in `write'
  2024-04-05 11:29:01 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:1225:in `try_flush'
  2024-04-05 11:29:01 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:1538:in `flush_thread_run'
  2024-04-05 11:29:01 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/plugin/output.rb:510:in `block (2 levels) in start'
  2024-04-05 11:29:01 +0000 [warn]: #0 /usr/local/bundle/gems/fluentd-1.16.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2024-04-05 11:29:01 +0000 [warn]: #0 [clusterflow:logging:hosttailer-flow:clusteroutput:logging:syslog] bad chunk is moved to /buffers/backup/worker0/clusterflow_logging_hosttailer-flow_clusteroutput_logging_syslog/61557c20915175b74f5d02915b7386cb.log

Environment details:

/kind bug

pepov commented 5 months ago

@kefiras this error message alone doesn't tell much about the original problem

kefiras commented 5 months ago

Debug is already enabled

bad chunk

??f?.FsN??time?2024-04-09T10:18:22.776368974Z?message?:Apr  9 10:18:22 aks-prometheus-18130450-vmss000000 kernel: [498058.497065] calico-packet: IN=azve56f4c00502 OUT=azva623c2d61aa MAC=aa:aa:aa:aa:aa:aa:6a:73:f2:79:14:75:08:00 SRC=10.244.3.144 DST=10.244.3.135 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=35709 DF PROTO=TCP SPT=36200 DPT=2020 WINDOW=64240 RES=0x00 SYN URGP=0 ?app?host-tailer?container_image?Lrepo-aks.qa.example.net/example/linux/exm/exm/vendor/fluent/fluent-bit:2.1.8?clustername?aks1kexm1?datacenter?eastus2?env?nonprod?family?logging?mnemonic?exm?hostname?"aks-prometheus-18130450-vmss000000?namespace?logging?pod_id?$2461060d-4eb9-41ec-8fe2-eefcf4bad090?pod_name?filetail-host-tailer-phq7s?service?syslog/ $ 

I haven't checked receiving side but I doubt anything is send

stale[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions!

liz-86 commented 2 months ago

We encountered the same error. Is it possible to open this issue again?

pepov commented 2 months ago

@liz-86 can you add some details to this? do you see this error with the latest image versions as well?

liz-86 commented 2 months ago

Yes, we tested our configuration (much the same as the above mentioned but with tcp transport and not tls) with the latest fluentd image (kube-logging/fluentd-images:v1.16-full). Our ClusterOutput:

apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterOutput
metadata:
  name: syslog
  namespace: logging
spec:
  syslog:
    buffer:
      flush_thread_count: 16
      timekey: 1m
      timekey_use_utc: true
      timekey_wait: 30s
    format:
      type: json
    host: syslog.example.net
    insecure: true
    port: 5056
    transport: tcp

The created fluentd.conf is the following (from k8s secret loggging-operator-logging-fluentd-app):

  <match **>
    @type syslog_rfc5424
    @id clusterflow:logging:syslog-flow:clusteroutput:logging:syslog-output
    host syslog.example.net
    insecure true
    port 5056
    transport tcp
    <buffer tag,time>
      @type file
      chunk_limit_size 8MB
      flush_thread_count 16
      path /buffers/clusterflow:logging:syslog-flow:clusteroutput:logging:syslog-output.*.buffer
      retry_forever true
      timekey 1m
      timekey_use_utc true
      timekey_wait 30s
    </buffer>
    <format>
      @type json
    </format>
  </match>
TimWelter commented 2 months ago

Same issue here.

Provider: RKE2 Kubernetes Version: v1.27.12 +rke2r1 Chart: Logging (103.1.1+up4.4.0)

pepov commented 2 months ago

What are your fluentd and fluentbit image versions?

pepov commented 2 months ago

It seems I totally misunderstood the issue originally. I've looked at it once again and it seems that the ruby3 upgrade broke the syslog plugin because of the deprecation and removal of https://blog.saeloun.com/2019/10/07/ruby-2-7-keyword-arguments-redesign/

I've made a change here: https://github.com/pepov/fluent-plugin-syslog_rfc5424/commit/6404b617bc8d5ddd9cf4628cb601cf9b4718e7fb

Then applied on my fork of the fluentd image here: https://github.com/kube-logging/fluentd-images/compare/main...pepov:fluentd-images:main

I didn't have the time to test it with a syslog receiver, could you please give it a try with ghcr.io/pepov/fluentd:v1.16-full?

liz-86 commented 2 months ago

Thanks for looking into the issue. I can confirm that with the new image there are no more errors in the fluentd. I need to talk to another team to see if there are getting the desired logs. But it looks good at the moment.

Thanks again!

EDIT: All seems to be working perfectly. The other team's are getting logs. :)

pepov commented 2 months ago

thx for the confirmation, I'm making the PRs to have the fix released asap

pepov commented 2 months ago

The images have been updated with the fix with the 148th build: v1.16-full-build.148 v1.16-full

For logging operator 4.8: v1.16-4.8-full-build.148 v1.16-4.8-full