OCP 4.6: /var/run/ocp-collector/secrets/openshift-logforwarding-splunk/tls.crt Not Found

DanaEHI commented 3 years ago

We are on OpenShift 4.6 and are currently using the November release of this chart (the newest version doesn't send anything to Splunk for us).

Setting the ConfigMap fluentd-loglevel: warn continues to send Info messages to Splunk: e.g.

I0313 21:08:21.881697 1906 prober.go:126] Readiness probe for "grafana-79b74b9c85-qrh5r_openshift-monitoring(afeaa88f-be18-4c55-a5b3-053706a9f9cb):grafana-proxy" succeeded

I deleted the fluentd pods and openshift-logforwarding-splunk-# pods, but the messages being sent are still at the info level (20k+ per minute).

I attempted to use the new "for 4.6" version, but nothing sent to Splunk at all - I verified my HEC and Index were correct in the values file, and that I had at least one category set to send to Splunk. After installing the previous version, 'buffered' logs were successfully sent to Splunk (ones collected during the time the "4.6" version was installed)

I don't see any messages in the openshift-logforwarding-splunk-# pods, and verified that messages are still being sent to kibana during the times when nothing is going to Splunk.

DanaEHI commented 3 years ago

When using the current version (where it doesn't send anything to Splunk), the messages are getting buffered and the fluentd queue becomes overwhelmed and we get alerts.

Example:

Annotations
message = In the last minute, fluentd <redacted> buffer queue length increased more than 32. Current value is 216.
summary = Fluentd is overwhelmed

sabre1041 commented 3 years ago

@DanaEHI Can you check to see if there are any logs in the OpenShift fluentd pods?

DanaEHI commented 3 years ago

@sabre1041 Here are the errors in fluentd pods:

  2021-03-14 15:41:53 +0000 [warn]: failed to flush the buffer. retry_time=2 next_retry_seconds=2021-03-14 15:41:56 +0000 chunk="5bd80e1b7f2090a7ae49c527c319bf59" error_class=Errno::ENOENT error="No such file or directory @ rb_sysopen - /var/run/ocp-collector/secrets/openshift-logforwarding-splunk/tls.crt"
  2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin_helper/socket.rb:154:in `read'
  2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin_helper/socket.rb:154:in `socket_create_tls'
  2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward.rb:352:in `create_transfer_socket'
  2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward/connection_manager.rb:46:in `call'
  2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward/connection_manager.rb:46:in `connect'
  2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward.rb:732:in `connect'
  2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward.rb:606:in `send_data'
  2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward.rb:336:in `block in write'
  2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward/load_balancer.rb:46:in `block in select_healthy_node'
  2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward/load_balancer.rb:37:in `times'
  2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward/load_balancer.rb:37:in `select_healthy_node'
  2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/out_forward.rb:336:in `write'
  2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/output.rb:1125:in `try_flush'
  2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/output.rb:1431:in `flush_thread_run'
  2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin/output.rb:461:in `block (2 levels) in start'
  2021-03-14 15:41:53 +0000 [warn]: /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'

Here is my values file:

openshift:
  logging:
    namespace: openshift-logging
    elasticsearch:
      port: 9200
  forwarding:
    audit:
      elasticsearch: false
      splunk: false
    app:
      elasticsearch: false
      splunk: true
    infra:
      elasticsearch: false
      splunk: true

forwarding:
  fluentd:
    port: 24224
    sharedkey: splunkforwarding
    passphrase: ""
    ssl: true
    caFile: files/default-openshift-logging-fluentd.crt
    keyFile: files/default-openshift-logging-fluentd.key
    loglevel: warn
    replicas: 2
    # Set to true when version <4.6
    scl: false
    persistence:
      enabled: false
      size: 5Gi
      ## If defined, storageClassName: <storageClass>
      ## If set to "-", storageClassName: "", which disables dynamic provisioning
      ## If undefined (the default) or set to null, no storageClassName spec is
      ##   set, choosing the default provisioner.  (gp2 on AWS, standard on
      ##   GKE, AWS & OpenStack)
      ##
      # storageClass: "-"
      storageClass: ""
      accessMode: ReadWriteOnce
    image: registry.redhat.io/openshift4/ose-logging-fluentd:v4.6
    nodeSelector: {}
    tolerations: []
    affinity: {}
    resources:
      requests:
        cpu: 100m
        memory: 512Mi
      limits:
        cpu: 500m
        memory: 1024Mi
    updateStrategy:
      type: "RollingUpdate"
    buffer:
      "@type": memory
      chunk_limit_records: 100000
      chunk_limit_size: 200m
      flush_interval: 5s
      flush_thread_count: 1
      overflow_action: block
      retry_max_times: 3
      total_limit_size: 600m
  # Example configuration to support file based buffering
      # "@type": file
      # path: /var/log/fluentd/fluentd-buffers/buffer
      # flush_mode: interval
      # retry_type: exponential_backoff
      # flush_thread_count: 2
      # flush_interval: "5s"
      # retry_forever:
      # retry_max_interval: 30
      # chunk_limit_size: "200m"
      # total_limit_size: "600m"
      # chunk_limit_records: 100000
      # overflow_action: block
  splunk:
    # Specify Splunk HEC Token and Index
    token: ...
    index: ...
    protocol: http
    hostname: ...
    port: 80
    insecure: true
    sourcetype: openshift_logs
    source: openshift

    # Specify the CA Certificate for Splunk
    # caFile: "files/splunk-ca.crt"

sabre1041 commented 3 years ago

It was most likely caused by this commit to handle changes in 4.7.

I'll work on adding conditional logic to address <4.7 and will follow up with a resolution.

sabre1041 commented 3 years ago

@DanaEHI Pushed update to address issue. Can you retest?

DanaEHI commented 3 years ago

Thank you @sabre1041 - we are seeing logs in Splunk now! However, they appear to be at the Info level, even though I've set loglevel: warn in Values. Is that the right place to filter the logs? Or would I need to edit the ConfigMap and add in another <filter **> section?

sabre1041 commented 3 years ago

@sabre1041 that configuration is for the logging of the forwarder itself, and not the messages that it is forwarding

DanaEHI commented 3 years ago

Thank you for the extra clarification! This is working with 4.6.12.

sabre1041 / openshift-logforwarding-splunk

OCP 4.6: /var/run/ocp-collector/secrets/openshift-logforwarding-splunk/tls.crt Not Found #9