kube-logging / logging-operator

Logging operator for Kubernetes
https://kube-logging.dev
Apache License 2.0
1.53k stars 327 forks source link

Configcheck stuck on failed missing endpoint #849

Closed argais closed 1 year ago

argais commented 2 years ago

Describe the bug: On a freshly deployed cluster, ArgoCD deploys the logging operator with a clusterflow and clusteroutput that pushes logs to elasticsearch.

Since both are deployed at the same time, elasticsearch is not readily available and the fluentd configcheck pod fails, and stays stuck forever until deleted.

Deleting it manually is less than desirable as we wanted these clusters to come up without manual intervention.

Expected behaviour: Configcheck to fail and retry until it eventually succeeds

Steps to reproduce the bug: Deploy the logging operator using helm, latest version.

Cluster flow

  spec:
    filters:
    - dedot:
        de_dot_nested: true
        de_dot_separator: '-'
    - tag_normaliser:
        format: ${namespace_name}.${pod_name}.${container_name}
    globalOutputRefs:
    - es-output

Cluster output

  spec:
    elasticsearch:
      buffer:
        timekey: 5m
        timekey_use_utc: true
        timekey_wait: 30s
      data_stream_enable: true
      data_stream_name: logs-local-cluster
      host: elasticsearch-master
      port: 9200
      reconnect_on_error: true
      reload_connections: false
      reload_on_failure: true
      scheme: http
      ssl_verify: false

Fluentd config check pod will fail, with logs as

fluentd -c /fluentd/etc/fluent.conf --dry-run
2021-10-13 21:02:38 +0000 [info]: parsing config file is succeeded path="/fluentd/etc/fluent.conf"
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-mixin-config-placeholders' version '0.4.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-aws-elasticsearch-service' version '2.4.1'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-azure-storage-append-blob' version '0.2.1'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-cloudwatch-logs' version '0.14.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-concat' version '2.5.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-datadog' version '0.13.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-dedot_filter' version '1.0.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-detect-exceptions' version '0.0.13'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '5.1.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-enhance-k8s-metadata' version '2.0.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-gcs' version '0.4.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-gelf-hs' version '1.0.8'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-geoip' version '1.3.2'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-grafana-loki' version '1.2.16'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-kafka' version '0.17.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-kinesis' version '3.4.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-kubernetes-metadata-filter' version '2.5.3'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-kubernetes-sumologic' version '2.0.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-label-router' version '0.2.9'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-logdna' version '0.4.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-logzio' version '0.0.21'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-multi-format-parser' version '1.0.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-newrelic' version '1.2.1'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-oss' version '0.0.2'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-parser-logfmt' version '0.0.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-prometheus' version '2.0.2'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-record-modifier' version '2.1.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-redis' version '0.3.5'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-remote-syslog' version '1.1'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.4.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-s3' version '1.6.1'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-splunk-hec' version '1.2.7'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-sqs' version '3.0.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-sumologic_output' version '1.7.2'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-syslog_rfc5424' version '0.9.0.rc.8'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-tag-normaliser' version '0.1.1'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-throttle' version '0.0.5'
2021-10-13 21:02:38 +0000 [info]: gem 'fluent-plugin-webhdfs' version '1.5.0'
2021-10-13 21:02:38 +0000 [info]: gem 'fluentd' version '1.13.3'
2021-10-13 21:02:38 +0000 [info]: gem 'fluentd' version '1.12.2'
2021-10-13 21:02:38 +0000 [info]: starting fluentd-1.13.3 as dry run mode ruby="2.7.4"
2021-10-13 21:02:38 +0000 [info]: [clusterflow:addon-logging-operator:default-flow:0] DeDot will recurse nested hashes and arrays
2021-10-13 21:02:43 +0000 [error]: config error file="/fluentd/etc/fluent.conf" error_class=Fluent::ConfigError error="Failed to create data stream: <logs-local-cluster> connect_write timeout reached"

Additional context: If you delete the config check pod it will retry and fluentd/bit pods will come up as expected, but manual intervention is not really desirable.

Environment details:

/kind bug

pepov commented 2 years ago

Thanks for creating this, it seems totally reasonable.

eduardoscheidet commented 2 years ago

i have the same problem too

aslafy-z commented 2 years ago

This looks like https://github.com/uken/fluent-plugin-elasticsearch/issues/935 Have you tried to configure max_retry_putting_template to a higher value? This solution seems a bit fragile but might work.

I'm deploying Logging Operator and my output (Loki) at the same time and have no issues as I use ArgoCD Waves to order the deployments. It might fit your need too.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions!

pepov commented 1 year ago

Giving this another thought: trying to connect and write to an output during dry-run doesn't seem to be right to me, so I wouldn't say we could do anything about this other than probably make the configcheck pod optionally retryable, but I'm not sure it would worth the effort.

Closing this for now, but feel free to reopen if you think we should discuss this further.