kube-logging / logging-operator

Logging operator for Kubernetes
https://kube-logging.dev
Apache License 2.0
1.53k stars 327 forks source link

Fluentbit connection refused #579

Closed lucasclerissepro closed 3 years ago

lucasclerissepro commented 3 years ago

Describe the bug:

Fluentbit is not able to reach fluentd with default setup.

Logs from fluentbit pod:

2020-09-10T06:27:54.268412432Z Fluent Bit v1.5.4
2020-09-10T06:27:54.268449044Z * Copyright (C) 2019-2020 The Fluent Bit Authors
2020-09-10T06:27:54.268454259Z * Copyright (C) 2015-2018 Treasure Data
2020-09-10T06:27:54.268457953Z * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
2020-09-10T06:27:54.268461045Z * https://fluentbit.io
2020-09-10T06:27:54.268464262Z 
2020-09-10T06:27:54.269581428Z [2020/09/10 06:27:54] [ info] [engine] started (pid=1)
2020-09-10T06:27:54.269592699Z [2020/09/10 06:27:54] [ info] [storage] version=1.0.5, initializing...
2020-09-10T06:27:54.269596303Z [2020/09/10 06:27:54] [ info] [storage] root path '/buffers'
2020-09-10T06:27:54.269599543Z [2020/09/10 06:27:54] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
2020-09-10T06:27:54.269602531Z [2020/09/10 06:27:54] [ info] [storage] backlog input plugin: storage_backlog.1
2020-09-10T06:27:54.40060536Z [2020/09/10 06:27:54] [ info] [input:storage_backlog:storage_backlog.1] queue memory limit: 4.8M
2020-09-10T06:27:54.400633782Z [2020/09/10 06:27:54] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
2020-09-10T06:27:54.400863924Z [2020/09/10 06:27:54] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
2020-09-10T06:27:54.400877016Z [2020/09/10 06:27:54] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
2020-09-10T06:27:54.42508188Z [2020/09/10 06:27:54] [ info] [filter:kubernetes:kubernetes.0] API server connectivity OK
2020-09-10T06:27:54.425135276Z [2020/09/10 06:27:54] [ info] [sp] stream processor started
2020-09-10T06:27:54.849997936Z [2020/09/10 06:27:54] [ info] inotify_fs_add(): inode=1316674 watch_fd=1 name=/var/log/containers/covergo-concierge-6857685ddb-lfvxn_default_covergo-concierge-c98cf90ab9de116de7a902cf9e701e50a82d4c71c7a152067bc5678ff312fe2c.log
2020-09-10T06:27:56.341047276Z [2020/09/10 06:27:56] [ info] inotify_fs_add(): inode=1521344 watch_fd=2 name=/var/log/containers/prometheus-operator-grafana-c7689fcdf-hbwct_monitoring_init-chown-data-5140170d9cbc0a7bf8d8989e35fda734ae115084f58e6310ebbdc1205d963840.log
2020-09-10T06:27:56.444422117Z [2020/09/10 06:27:56] [ info] inotify_fs_add(): inode=1209419 watch_fd=3 name=/var/log/containers/alertmanager-prometheus-operator-alertmanager-0_monitoring_alertmanager-44fcce19a1e40f554fdc01fc6ac17b544d8194eaad2d1e6fe11e143e9a02685e.log
2020-09-10T06:27:56.444448735Z [2020/09/10 06:27:56] [ info] inotify_fs_add(): inode=1210044 watch_fd=4 name=/var/log/containers/alertmanager-prometheus-operator-alertmanager-0_monitoring_config-reloader-8d64b51f7f195835aeb37200265dfb56b5859862d505341941a04e0dbe5ac536.log
2020-09-10T06:27:56.444453226Z [2020/09/10 06:27:56] [ info] inotify_fs_add(): inode=1259780 watch_fd=5 name=/var/log/containers/cert-manager-webhook-gandi-585f67dc4f-p987d_cert-manager_cert-manager-webhook-gandi-66be78e19276322b3bec069ad451bd5fba781091dcd7763c9a094969872e6363.log
2020-09-10T06:27:56.444457116Z [2020/09/10 06:27:56] [ info] inotify_fs_add(): inode=1316794 watch_fd=6 name=/var/log/containers/cert-manager-webhook-gandi-585f67dc4f-p987d_cert-manager_cert-manager-webhook-gandi-b44fcd5a40c2b972fde2f1f1f5a92156dc85addfc6fd282dbe9e1b8e73b49041.log
2020-09-10T06:27:56.453071895Z [2020/09/10 06:27:56] [ info] inotify_fs_add(): inode=1251499 watch_fd=7 name=/var/log/containers/covergo-apiwebsite-7f78844b8f-w8x6s_default_covergo-apiwebsite-0286a379ebdbdd7a6435776d5b654f202fbb6d65532cbc158ebcc0bed6fbe907.log
2020-09-10T06:27:56.460085367Z [2020/09/10 06:27:56] [ info] inotify_fs_add(): inode=1316460 watch_fd=8 name=/var/log/containers/covergo-commissions-55b996669c-9rflc_default_covergo-commissions-5c2a800eb6ab5f5678bc355d14ba7432c88ecc1c8597441f56f1ed59d24b137a.log
2020-09-10T06:27:56.460102378Z [2020/09/10 06:27:56] [ info] inotify_fs_add(): inode=1260194 watch_fd=9 name=/var/log/containers/covergo-commissions-55b996669c-9rflc_default_covergo-commissions-8e54bd8fbf6ca062b7d5102bad3ec0627a2339be74510508c63ac9b7c3dea85f.log
2020-09-10T06:27:56.470808035Z [2020/09/10 06:27:56] [ info] inotify_fs_add(): inode=1316323 watch_fd=10 name=/var/log/containers/covergo-jetco-asia-7cd649567-cr9zv_default_covergo-jetco-asia-3db3c6154683c95b9d751b5dfeb499555671923c840f355809cb2eb8e51d3b10.log
2020-09-10T06:27:56.470824096Z [2020/09/10 06:27:56] [ info] inotify_fs_add(): inode=410285 watch_fd=11 name=/var/log/containers/covergo-jetco-asia-7cd649567-cr9zv_default_covergo-jetco-asia-8fd0963490b4ad3f858fc2d8fc04aa6d71d7a479afd2a86ab9405ef8f58fdf79.log
2020-09-10T06:27:56.47600301Z [2020/09/10 06:27:56] [ info] inotify_fs_add(): inode=1069727 watch_fd=12 name=/var/log/containers/covergo-jetco-asia-uat-755dbd58b4-xdhnv_default_covergo-jetco-asia-uat-f5a60dab37fa67394a25941927bfca0e08824a5068a02507c5e036d7925e2d08.log
2020-09-10T06:27:56.476018983Z [2020/09/10 06:27:56] [ info] inotify_fs_add(): inode=1995624 watch_fd=13 name=/var/log/containers/covergo-logging-bd8bdcddd-b6bx6_default_covergo-logging-1-40ae0a9ca27a097a002fedbc3190e141982755e3f053cc78d6ac0198a64ef52f.log
2020-09-10T06:27:56.546883415Z [2020/09/10 06:27:56] [ warn] [input] tail.0 paused (mem buf overlimit)
2020-09-10T06:27:57.553905995Z [2020/09/10 06:27:57] [error] [src/flb_io.c:171 errno=111] Connection refused
2020-09-10T06:27:57.553944176Z [2020/09/10 06:27:57] [error] [io] connection #80 failed to: covergo-fluentd.logging.svc:24240
2020-09-10T06:27:57.553948584Z [2020/09/10 06:27:57] [error] [output:forward:forward.0] no upstream connections available
2020-09-10T06:27:58.557856778Z [2020/09/10 06:27:58] [error] [src/flb_io.c:171 errno=111] Connection refused
2020-09-10T06:27:58.557893674Z [2020/09/10 06:27:58] [error] [io] connection #80 failed to: covergo-fluentd.logging.svc:24240
2020-09-10T06:27:58.557906244Z [2020/09/10 06:27:58] [error] [output:forward:forward.0] no upstream connections available
2020-09-10T06:27:59.562471605Z [2020/09/10 06:27:59] [error] [src/flb_io.c:171 errno=111] Connection refused
2020-09-10T06:27:59.562498966Z [2020/09/10 06:27:59] [error] [io] connection #80 failed to: covergo-fluentd.logging.svc:24240
2020-09-10T06:27:59.562503855Z [2020/09/10 06:27:59] [error] [output:forward:forward.0] no upstream connections available
2020-09-10T06:28:00.565826848Z [2020/09/10 06:28:00] [error] [src/flb_io.c:171 errno=111] Connection refused
2020-09-10T06:28:00.565860127Z [2020/09/10 06:28:00] [error] [io] connection #80 failed to: covergo-fluentd.logging.svc:24240
2020-09-10T06:28:00.56588899Z [2020/09/10 06:28:00] [error] [output:forward:forward.0] no upstream connections available

Expected behaviour:

Steps to reproduce the bug:

Using those manifests after installing the operator:

apiVersion: logging.banzaicloud.io/v1beta1
kind: Logging
metadata:
  name: logging-operator
spec:
  fluentd:
    scaling:
      replicas: 1
    bufferStorageVolume:
      pvc:
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 20Gi
          storageClassName: alicloud-disk-ssd
  fluentbit:
    filterKubernetes:
      tls.verify: "Off"
  controlNamespace: logging
---
apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterOutput
metadata:
  name: loki-sink
spec:
  loki:
    url: http://loki:3100
    configure_kubernetes_labels: true
    buffer:
      timekey: 1m
      timekey_wait: 30s
      timekey_use_utc: true
---
apiVersion: logging.banzaicloud.io/v1beta1
kind: ClusterFlow
metadata:
  name: loki-apps
spec:
  outputRefs:
    - loki-sink
  match:
    - select:
        namespaces:
          - default

Environment details:


- Cloud-provider/provisioner: Alicloud
- logging-operator version: latest

/kind bug
tarokkk commented 3 years ago

Can you please confirm the error if you are not using Loki output?

sdurrheimer commented 3 years ago

I'm having a similar kind of issue.

I'm using a cloudwatch output. Logs works great at first when deployed, but at some point fluentbit is complaining about no upstream connections leading to no more logs in my backend.

logging-operator-logging-fluentbit-hft58 fluent-bit [2020/10/27 10:15:09] [error] [io_tls] flb_io_tls.c:356 NET - Connection was reset by peer
logging-operator-logging-fluentbit-hft58 fluent-bit [2020/10/27 10:15:09] [error] [output:forward:forward.0] no upstream connections available
logging-operator-logging-fluentbit-hft58 fluent-bit [2020/10/27 10:15:09] [ warn] [engine] failed to flush chunk '1-1603719773.125817609.flb', retry in 811 seconds: task_id=620, input=tail.0 > output=forward.0

I can also see the following error in /fluentd/log/out file of the fluentd pod:

2020-10-27 10:19:27 +0000 [warn]: #0 [main_forward] unexpected error before accepting TLS connection by OpenSSL error_class=OpenSSL::SSL::SSLError error="SSL_accept returned=1 errno=0 state=error: certificate verify failed (certificate signature failure)"
jasonaliyetti commented 3 years ago

I'm hitting this as well. Anytime I change flow definitions fluentbit suddenly stops being able to connect to fluentd and I see the TLS error (same one mentioned above) on the fluentd side. FWIW, the error definitely seems to be reproducible with the default configuration from the logging-operator-logging chart.

jasonaliyetti commented 3 years ago

I verified that disabling TLS prevents this behavior.

tarokkk commented 3 years ago

If you are using TLS you need to configure the certificates properly. You can use One Eye to simplify the install flow for fluentbit and fluentd.

$ one-eye cert-manager install
$ one-eye logging install
$ one-eye logging configure --secure
sdurrheimer commented 3 years ago

I'm not even sur TLS is the issue here, because I'm still having issue of logs not being forwarded anymore because of no upstream connections available errors:

[2020/11/12 17:03:21] [error] [io] connection #105 failed to: logging-operator-logging-fluentd.logging.svc:24240
[2020/11/12 17:03:21] [error] [output:forward:forward.0] no upstream connections available

Killing fluentbit and fluentd pods makes it work again, but then after a short time, the no upstream connections available errors happens again.

Going inside the fluentd pod and looking for listening ports:

/ $ netstat -tunapl
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:24444         0.0.0.0:*               LISTEN      6/ruby

The port 24240 is not listed.

Checking the running processes:

/ $ ps afx
PID   USER     TIME  COMMAND
    1 fluent    0:02 tini -- /bin/entrypoint.sh fluentd -o /fluentd/log/out --log-rotate-age 10 --log-rotate-size 10485760
    6 fluent    0:53 {fluentd} /usr/bin/ruby /usr/bin/fluentd -o /fluentd/log/out --log-rotate-age 10 --log-rotate-size 10485760 -c /fluentd/etc/fluent.conf -p /fluentd/plugins
   25 fluent    0:00 sh
 1535 fluent    0:27 /usr/bin/ruby -Eascii-8bit:ascii-8bit /usr/bin/fluentd -o /fluentd/log/out --log-rotate-age 10 --log-rotate-size 10485760 -c /fluentd/etc/fluent.conf -p /fluentd/plugins --under-supervisor
 1561 fluent    0:00 sh
 1567 fluent    0:00 ps afx

Checking /fluentd/log/out, we can the fluentd worker being SIGKILL for no apparent reason (the last part repeating itself endlessly):

Logs

``` # Logfile created on 2020-11-12 17:07:28 +0000 by logger.rb/v1.4.2 2020-11-12 17:07:28 +0000 [info]: parsing config file is succeeded path="/fluentd/etc/fluent.conf" 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-mixin-config-placeholders' version '0.4.0' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-aws-elasticsearch-service' version '2.4.0' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-azurestorage' version '0.1.0' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-cloudwatch-logs' version '0.11.0' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-concat' version '2.4.0' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-datadog' version '0.12.1' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-dedot_filter' version '1.0.0' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-detect-exceptions' version '0.0.13' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '4.2.2' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-gcs' version '0.4.0' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-geoip' version '1.3.2' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-grafana-loki' version '1.2.15' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-kafka' version '0.15.2' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-kinesis' version '3.2.3' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-kubernetes_metadata_filter' version '2.5.2' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-label-router' version '0.2.4' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-logdna' version '0.4.0' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-logzio' version '0.0.20' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-multi-format-parser' version '1.0.0' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-newrelic' version '1.1.10' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-oss' version '0.0.2' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-parser-logfmt' version '0.0.0' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-prometheus' version '1.8.4' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-record-modifier' version '2.1.0' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-redis' version '0.3.4' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-remote-syslog' version '1.1' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.3.0' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-s3' version '1.4.0' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-splunk-hec' version '1.2.4' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-sumologic_output' version '1.7.1' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-tag-normaliser' version '0.1.1' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-throttle' version '0.0.5' 2020-11-12 17:07:28 +0000 [info]: gem 'fluent-plugin-webhdfs' version '1.2.5' 2020-11-12 17:07:28 +0000 [info]: gem 'fluentd' version '1.11.4' 2020-11-12 17:07:29 +0000 [info]: using configuration file: rpc_endpoint "127.0.0.1:24444" log_level info workers 1 @type prometheus port 24231 metrics_path "/metrics" @type prometheus_monitor @type prometheus_output_monitor @type forward @id main_forward bind "0.0.0.0" port 24240 @type label_router @id main metrics true @label "@6457b2c7b5df52ec69b11e19290b49b6" metrics_labels {"id":"clusterflow:logging:all-pods"} namespaces negate false @type null @id main-no-output 2020-11-12 17:07:29 +0000 [info]: starting fluentd-1.11.4 pid=6 ruby="2.7.1" 2020-11-12 17:07:29 +0000 [info]: spawn command to main: cmdline=["/usr/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/bin/fluentd", "-o", "/fluentd/log/out", "--log-rotate-age", "10", "--log-rotate-size", "10485760", "-c", "/fluentd/etc/fluent.conf", "-p", "/fluentd/plugins", "--under-supervisor"] 2020-11-12 17:07:30 +0000 [info]: adding match in @6457b2c7b5df52ec69b11e19290b49b6 pattern="kubernetes.**" type="tag_normaliser" 2020-11-12 17:07:30 +0000 [info]: adding match in @6457b2c7b5df52ec69b11e19290b49b6 pattern="kubernetes.**" type="detect_exceptions" 2020-11-12 17:07:30 +0000 [info]: adding filter in @6457b2c7b5df52ec69b11e19290b49b6 pattern="**" type="record_transformer" 2020-11-12 17:07:30 +0000 [info]: adding match in @6457b2c7b5df52ec69b11e19290b49b6 pattern="**" type="cloudwatch_logs" 2020-11-12 17:07:31 +0000 [info]: adding match in @FLUENT_LOG pattern="fluent.*" type="null" 2020-11-12 17:07:31 +0000 [info]: adding match in @ERROR pattern="**" type="null" 2020-11-12 17:07:31 +0000 [info]: adding match pattern="**" type="label_router" 2020-11-12 17:07:31 +0000 [info]: adding match pattern="**" type="null" 2020-11-12 17:07:31 +0000 [info]: adding source type="prometheus" 2020-11-12 17:07:31 +0000 [info]: adding source type="prometheus_monitor" 2020-11-12 17:07:31 +0000 [info]: adding source type="prometheus_output_monitor" 2020-11-12 17:07:31 +0000 [info]: adding source type="forward" 2020-11-12 17:07:31 +0000 [warn]: parameter 'metrics' in @type label_router @id main metrics true @label "@6457b2c7b5df52ec69b11e19290b49b6" metrics_labels {"id":"clusterflow:logging:all-pods"} namespaces negate false is not used. 2020-11-12 17:07:31 +0000 [warn]: parameter 'metrics_labels' in @label "@6457b2c7b5df52ec69b11e19290b49b6" metrics_labels {"id":"clusterflow:logging:all-pods"} namespaces negate false is not used. 2020-11-12 17:07:31 +0000 [info]: #0 starting fluentd worker pid=15 ppid=6 worker=0 2020-11-12 17:12:07 +0000 [info]: #0 [main_forward] listening port port=24240 bind="0.0.0.0" 2020-11-12 17:12:07 +0000 [info]: #0 fluentd worker is now running worker=0 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=71 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=69 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=84 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=32 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=47 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=32 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=72 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=84 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=84 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=80 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=47 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=49 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=46 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=49 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=13 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=10 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=72 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=111 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=115 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=116 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=58 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=32 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=108 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=111 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=99 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=97 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=108 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=104 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=111 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=115 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=116 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=58 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=50 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=52 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=50 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=52 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=48 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=13 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=10 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=85 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=115 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=101 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=114 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=45 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=65 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=103 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=101 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=110 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=116 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=58 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=32 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=99 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=117 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=114 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=108 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=47 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=55 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=46 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=54 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=57 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=46 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=49 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=13 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=10 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=65 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=99 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=99 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=101 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=112 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=116 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=58 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=32 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=42 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=47 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=42 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=13 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=10 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=13 2020-11-12 17:12:33 +0000 [warn]: #0 [main_forward] incoming chunk is broken: host="127.0.0.1" msg=10 2020-11-12 17:22:11 +0000 [info]: Worker 0 finished unexpectedly with signal SIGKILL 2020-11-12 17:22:12 +0000 [info]: adding match in @6457b2c7b5df52ec69b11e19290b49b6 pattern="kubernetes.**" type="tag_normaliser" 2020-11-12 17:22:12 +0000 [info]: adding match in @6457b2c7b5df52ec69b11e19290b49b6 pattern="kubernetes.**" type="detect_exceptions" 2020-11-12 17:22:12 +0000 [info]: adding filter in @6457b2c7b5df52ec69b11e19290b49b6 pattern="**" type="record_transformer" 2020-11-12 17:22:12 +0000 [info]: adding match in @6457b2c7b5df52ec69b11e19290b49b6 pattern="**" type="cloudwatch_logs" 2020-11-12 17:22:13 +0000 [info]: adding match in @FLUENT_LOG pattern="fluent.*" type="null" 2020-11-12 17:22:13 +0000 [info]: adding match in @ERROR pattern="**" type="null" 2020-11-12 17:22:13 +0000 [info]: adding match pattern="**" type="label_router" 2020-11-12 17:22:13 +0000 [info]: adding match pattern="**" type="null" 2020-11-12 17:22:13 +0000 [info]: adding source type="prometheus" 2020-11-12 17:22:13 +0000 [info]: adding source type="prometheus_monitor" 2020-11-12 17:22:13 +0000 [info]: adding source type="prometheus_output_monitor" 2020-11-12 17:22:13 +0000 [info]: adding source type="forward" 2020-11-12 17:22:13 +0000 [warn]: parameter 'metrics' in @type label_router @id main metrics true @label "@6457b2c7b5df52ec69b11e19290b49b6" metrics_labels {"id":"clusterflow:logging:all-pods"} namespaces negate false is not used. 2020-11-12 17:22:13 +0000 [warn]: parameter 'metrics_labels' in @label "@6457b2c7b5df52ec69b11e19290b49b6" metrics_labels {"id":"clusterflow:logging:all-pods"} namespaces negate false is not used. 2020-11-12 17:22:13 +0000 [info]: #0 starting fluentd worker pid=54 ppid=6 worker=0 2020-11-12 17:26:42 +0000 [info]: #0 [main_forward] listening port port=24240 bind="0.0.0.0" 2020-11-12 17:26:42 +0000 [info]: #0 fluentd worker is now running worker=0 2020-11-12 17:32:48 +0000 [info]: Worker 0 finished unexpectedly with signal SIGKILL 2020-11-12 17:32:50 +0000 [info]: adding match in @6457b2c7b5df52ec69b11e19290b49b6 pattern="kubernetes.**" type="tag_normaliser" 2020-11-12 17:32:50 +0000 [info]: adding match in @6457b2c7b5df52ec69b11e19290b49b6 pattern="kubernetes.**" type="detect_exceptions" 2020-11-12 17:32:50 +0000 [info]: adding filter in @6457b2c7b5df52ec69b11e19290b49b6 pattern="**" type="record_transformer" 2020-11-12 17:32:50 +0000 [info]: adding match in @6457b2c7b5df52ec69b11e19290b49b6 pattern="**" type="cloudwatch_logs" 2020-11-12 17:32:50 +0000 [info]: adding match in @FLUENT_LOG pattern="fluent.*" type="null" 2020-11-12 17:32:50 +0000 [info]: adding match in @ERROR pattern="**" type="null" 2020-11-12 17:32:50 +0000 [info]: adding match pattern="**" type="label_router" 2020-11-12 17:32:51 +0000 [info]: adding match pattern="**" type="null" 2020-11-12 17:32:51 +0000 [info]: adding source type="prometheus" 2020-11-12 17:32:51 +0000 [info]: adding source type="prometheus_monitor" 2020-11-12 17:32:51 +0000 [info]: adding source type="prometheus_output_monitor" 2020-11-12 17:32:51 +0000 [info]: adding source type="forward" 2020-11-12 17:32:51 +0000 [warn]: parameter 'metrics' in @type label_router @id main metrics true @label "@6457b2c7b5df52ec69b11e19290b49b6" metrics_labels {"id":"clusterflow:logging:all-pods"} namespaces negate false is not used. 2020-11-12 17:32:51 +0000 [warn]: parameter 'metrics_labels' in @label "@6457b2c7b5df52ec69b11e19290b49b6" metrics_labels {"id":"clusterflow:logging:all-pods"} namespaces negate false is not used. 2020-11-12 17:32:51 +0000 [info]: #0 starting fluentd worker pid=71 ppid=6 worker=0 2020-11-12 17:37:09 +0000 [info]: #0 [main_forward] listening port port=24240 bind="0.0.0.0" 2020-11-12 17:37:09 +0000 [info]: #0 fluentd worker is now running worker=0 2020-11-12 17:44:54 +0000 [info]: Worker 0 finished unexpectedly with signal SIGKILL 2020-11-12 17:44:55 +0000 [info]: adding match in @6457b2c7b5df52ec69b11e19290b49b6 pattern="kubernetes.**" type="tag_normaliser" 2020-11-12 17:44:55 +0000 [info]: adding match in @6457b2c7b5df52ec69b11e19290b49b6 pattern="kubernetes.**" type="detect_exceptions" 2020-11-12 17:44:55 +0000 [info]: adding filter in @6457b2c7b5df52ec69b11e19290b49b6 pattern="**" type="record_transformer" 2020-11-12 17:44:56 +0000 [info]: adding match in @6457b2c7b5df52ec69b11e19290b49b6 pattern="**" type="cloudwatch_logs" 2020-11-12 17:44:56 +0000 [info]: adding match in @FLUENT_LOG pattern="fluent.*" type="null" 2020-11-12 17:44:56 +0000 [info]: adding match in @ERROR pattern="**" type="null" 2020-11-12 17:44:56 +0000 [info]: adding match pattern="**" type="label_router" 2020-11-12 17:44:56 +0000 [info]: adding match pattern="**" type="null" 2020-11-12 17:44:56 +0000 [info]: adding source type="prometheus" 2020-11-12 17:44:56 +0000 [info]: adding source type="prometheus_monitor" 2020-11-12 17:44:56 +0000 [info]: adding source type="prometheus_output_monitor" 2020-11-12 17:44:56 +0000 [info]: adding source type="forward" 2020-11-12 17:44:56 +0000 [warn]: parameter 'metrics' in @type label_router @id main metrics true @label "@6457b2c7b5df52ec69b11e19290b49b6" metrics_labels {"id":"clusterflow:logging:all-pods"} namespaces negate false is not used. 2020-11-12 17:44:56 +0000 [warn]: parameter 'metrics_labels' in @label "@6457b2c7b5df52ec69b11e19290b49b6" metrics_labels {"id":"clusterflow:logging:all-pods"} namespaces negate false is not used. 2020-11-12 17:44:56 +0000 [info]: #0 starting fluentd worker pid=89 ppid=6 worker=0 2020-11-12 17:49:21 +0000 [info]: #0 [main_forward] listening port port=24240 bind="0.0.0.0" 2020-11-12 17:49:21 +0000 [info]: #0 fluentd worker is now running worker=0 2020-11-12 17:54:08 +0000 [info]: Worker 0 finished unexpectedly with signal SIGKILL ```

Could this be linked to memory of buffer limits? I'm not sure how this is handled automatically by the operator, but I'm seeing the same kind of errors using fluentd directly (https://github.com/fluent/fluentd/issues/1422).

EDIT: looks like to be indeed a resources limitation. If you are having no upstream connections available from fluentbit, try bumping fluentd resources requests and limits. Didn't try to enable TLS again.

pepov commented 3 years ago

Sorry for not getting back to this, yes this is typically because of resource exhaustion. Fluentd does not handle a large number of chunks well when starting up, so that can be a problem as well. Feel free to reopen if you think this is something we should be able to fix on the operator side.