signalfx / splunk-otel-collector-chart

Splunk OpenTelemetry Collector for Kubernetes
Apache License 2.0
119 stars 148 forks source link

Error: cannot start pipelines: start stanza: read known files from database: illegal base64 data at input byte 2281 #745

Closed louise-zhang closed 1 year ago

louise-zhang commented 1 year ago

Hi Team,

I am getting the below error message on one particular OpenShift master Node, only the agent which is running on this node is in CrashLoopBackOff state, other agents are working totally fine.

2023-04-19T04:27:38.753Z info service/service.go:168 Shutdown complete.
Error: cannot start pipelines: start stanza: read known files from database: illegal base64 data at input byte 2281
2023/04/19 04:27:38 main.go:115: application run finished with error: cannot start pipelines: start stanza: read known files from database: illegal base64 data at input byte 2281

Azure Redhat OpenShift: v4.10.x splunk-otel-collector: v0.70.0

omrozowicz-splunk commented 1 year ago

Hey, This looks like an error that is being returned by filelog receiver when there's a problem with decoding kubernetes logs. Can you paste here what config do you use currently? I mean obviously the part without any sensitive data so I can learn how you're gathering logs from this openshift instance and what part is to blame in this case.

louise-zhang commented 1 year ago

Thanks @omrozowicz-splunk, please find the config below:

exporters:
  splunk_hec/platform_logs:
    disable_compression: true
    endpoint: http://splunk-***.com:8000/services/collector
    index: aro_nonprod
    max_connections: 200
    profiling_data_enabled: false
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_elapsed_time: 300s
      max_interval: 30s
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000
    source: kubernetes
    splunk_app_name: splunk-otel-collector
    splunk_app_version: 0.70.0
    timeout: 10s
    tls:
      insecure_skip_verify: false
    token: ${SPLUNK_PLATFORM_HEC_TOKEN}
extensions:
  file_storage:
    directory: /var/addon/splunk/otel_pos
  health_check: null
  k8s_observer:
    auth_type: serviceAccount
    node: ${K8S_NODE_NAME}
  memory_ballast:
    size_mib: ${SPLUNK_BALLAST_SIZE_MIB}
  zpages: null
processors:
  batch: null
  filter/logs:
    logs:
      exclude:
        match_type: strict
        resource_attributes:
        - key: splunk.com/exclude
          value: "true"
  k8sattributes:
    extract:
      annotations:
      - from: pod
        key: splunk.com/sourcetype
      - from: namespace
        key: splunk.com/exclude
        tag_name: splunk.com/exclude
      - from: pod
        key: splunk.com/exclude
        tag_name: splunk.com/exclude
      - from: namespace
        key: splunk.com/index
        tag_name: com.splunk.index
      - from: pod
        key: splunk.com/index
        tag_name: com.splunk.index
      labels:
      - key: app
      metadata:
      - k8s.namespace.name
      - k8s.node.name
      - k8s.pod.name
      - k8s.pod.uid
      - container.id
      - container.image.name
      - container.image.tag
    filter:
      node_from_env_var: K8S_NODE_NAME
    pod_association:
    - sources:
      - from: resource_attribute
        name: k8s.pod.uid
    - sources:
      - from: resource_attribute
        name: k8s.pod.ip
    - sources:
      - from: resource_attribute
        name: ip
    - sources:
      - from: connection
    - sources:
      - from: resource_attribute
        name: host.name
  memory_limiter:
    check_interval: 2s
    limit_mib: ${SPLUNK_MEMORY_LIMIT_MIB}
  resource:
    attributes:
    - action: insert
      key: k8s.node.name
      value: ${K8S_NODE_NAME}
    - action: upsert
      key: k8s.cluster.name
      value: aro-nonprod
  resource/add_agent_k8s:
    attributes:
    - action: insert
      key: k8s.pod.name
      value: ${K8S_POD_NAME}
    - action: insert
      key: k8s.pod.uid
      value: ${K8S_POD_UID}
    - action: insert
      key: k8s.namespace.name
      value: ${K8S_NAMESPACE}
  resource/logs:
    attributes:
    - action: upsert
      from_attribute: k8s.pod.annotations.splunk.com/sourcetype
      key: com.splunk.sourcetype
    - action: delete
      key: k8s.pod.annotations.splunk.com/sourcetype
    - action: delete
      key: splunk.com/exclude
    - action: upsert
      from_attribute: k8s.container.name
      key: container_name
    - action: upsert
      from_attribute: k8s.cluster.name
      key: cluster_name
    - action: upsert
      from_attribute: container.id
      key: container_id
    - action: upsert
      from_attribute: k8s.pod.name
      key: pod
    - action: upsert
      from_attribute: k8s.pod.uid
      key: pod_uid
    - action: upsert
      from_attribute: k8s.namespace.name
      key: namespace
    - action: delete
      key: k8s.container.name
    - action: delete
      key: k8s.cluster.name
    - action: delete
      key: container.id
    - action: delete
      key: k8s.pod.name
    - action: delete
      key: k8s.pod.uid
    - action: delete
      key: k8s.namespace.name
  resourcedetection:
    detectors:
    - env
    - system
    override: true
    timeout: 10s
receivers:
  filelog:
    encoding: utf-8
    exclude:
    - /var/log/pods/openshift-logging_splunk-otel-collector*_*/otel-collector/*.log
    - /var/log/pods/openshift-azure-logging_*/*/*.log
    fingerprint_size: 1kb
    force_flush_period: "0"
    include:
    - /var/log/pods/*/*/*.log
    include_file_name: false
    include_file_path: true
    max_concurrent_files: 1024
    max_log_size: 1MiB
    operators:
    - id: get-format
      routes:
      - expr: body matches "^\\{"
        output: parser-docker
      - expr: body matches "^[^ Z]+ "
        output: parser-crio
      - expr: body matches "^[^ Z]+Z"
        output: parser-containerd
      type: router
    - id: parser-crio
      regex: ^(?P<time>[^ Z]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$
      timestamp:
        layout: "2006-01-02T15:04:05.999999999-07:00"
        layout_type: gotime
        parse_from: attributes.time
      type: regex_parser
    - combine_field: attributes.log
      combine_with: ""
      id: crio-recombine
      is_last_entry: attributes.logtag == 'F'
      output: handle_empty_log
      source_identifier: attributes["log.file.path"]
      type: recombine
    - id: parser-containerd
      regex: ^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$
      timestamp:
        layout: '%Y-%m-%dT%H:%M:%S.%LZ'
        parse_from: attributes.time
      type: regex_parser
    - combine_field: attributes.log
      combine_with: ""
      id: containerd-recombine
      is_last_entry: attributes.logtag == 'F'
      output: handle_empty_log
      source_identifier: attributes["log.file.path"]
      type: recombine
    - id: parser-docker
      timestamp:
        layout: '%Y-%m-%dT%H:%M:%S.%LZ'
        parse_from: attributes.time
      type: json_parser
    - combine_field: attributes.log
      combine_with: ""
      id: docker-recombine
      is_last_entry: attributes.log endsWith "\n"
      output: handle_empty_log
      source_identifier: attributes["log.file.path"]
      type: recombine
    - field: attributes.log
      id: handle_empty_log
      if: attributes.log == nil
      type: add
      value: ""
    - parse_from: attributes["log.file.path"]
      regex: ^\/var\/log\/pods\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[^\/]+)\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$
      type: regex_parser
    - from: attributes.uid
      to: resource["k8s.pod.uid"]
      type: move
    - from: attributes.restart_count
      to: resource["k8s.container.restart_count"]
      type: move
    - from: attributes.container_name
      to: resource["k8s.container.name"]
      type: move
    - from: attributes.namespace
      to: resource["k8s.namespace.name"]
      type: move
    - from: attributes.pod_name
      to: resource["k8s.pod.name"]
      type: move
    - field: resource["com.splunk.sourcetype"]
      type: add
      value: EXPR("kube:container:"+resource["k8s.container.name"])
    - from: attributes.stream
      to: attributes["log.iostream"]
      type: move
    - from: attributes["log.file.path"]
      to: resource["com.splunk.source"]
      type: move
    - default: clean-up-log-record
      routes:
      - expr: (resource["k8s.namespace.name"]) == "openshift-operators" && (resource["k8s.pod.name"])
          matches "amq-streams-cluster-operator-.*" && (resource["k8s.container.name"])
          == "strimzi-cluster-operator"
        output: openshift-operators_amq-streams-cluster-operator-.*_strimzi-cluster-operator
      type: router
    - combine_field: attributes.log
      id: openshift-operators_amq-streams-cluster-operator-.*_strimzi-cluster-operator
      is_first_entry: (attributes.log) matches "^[^\\s].*"
      output: clean-up-log-record
      source_identifier: resource["com.splunk.source"]
      type: recombine
    - from: attributes.log
      id: clean-up-log-record
      to: body
      type: move
    poll_interval: 200ms
    start_at: beginning
    storage: file_storage
  fluentforward:
    endpoint: 0.0.0.0:8006
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus/agent:
    config:
      scrape_configs:
      - job_name: otel-agent
        scrape_interval: 10s
        static_configs:
        - targets:
          - ${K8S_POD_IP}:8889
service:
  extensions:
  - file_storage
  - health_check
  - k8s_observer
  - memory_ballast
  - zpages
  pipelines:
    logs:
      exporters:
      - splunk_hec/platform_logs
      processors:
      - memory_limiter
      - k8sattributes
      - filter/logs
      - batch
      - resource/logs
      - resourcedetection
      - resource
      receivers:
      - filelog
      - fluentforward
      - otlp
  telemetry:
    metrics:
      address: 0.0.0.0:8889
omrozowicz-splunk commented 1 year ago

Hey, thanks for the config. Can you tell me did you try to reinstall the chart? Does this error happen every time? This is somewhat connected with storage capability. It looks like its struggling decode data from /var/addon/splunk/otel_pos files. I think reinstall doesn't make old storage disappear, so can we try to run agent with a different storage path? For ex.

logsCollection:
  checkpointPath: /var/addon/splunk/otel_pos2

This is what you need to inset to your config path and upgrade the chart.

louise-zhang commented 1 year ago

Thanks @omrozowicz-splunk.

We have upgraded ARO cluster to 4.11.39, and upgraded the splunk-otel-collector to 0.76.0, but the issue still persists.

Following your recommendation and we have updated the storage path, the pod on that particular node is now up and running.

However, we would like to understand the root cause of the issue.

Also, we would like to confirm whether we need to manually remove that storage directory on that node and keep the existing configuration. Or do we need to update the storage path (checkpointPath) to /var/addon/splunk/otel_pos2?

Thank you so much for the support so far.

omrozowicz-splunk commented 1 year ago

Hey, it seems like a checkpointing file on that node got corrupted.

Can you tell me if you've ever used SCK (Splunk Collector for Kubernetes) on this node before? As in this case this can be something with translator from SCK. Another reason would be if the otel collector was forcefully killed when it was writing the file, do you recall something like this?

It might either be one of these two or some other issue in collector stanza component. Replacing checkpointPath fixed it as you switched from the corrupted file to the new one.

Also, we would like to confirm whether we need to manually remove that storage directory on that node and keep the existing configuration. Or do we need to update the storage path (checkpointPath) to /var/addon/splunk/otel_pos2?

Right now your checkpoint path is var/addon/splunk/otel_pos2 so only this one is important from otel collector perspective.

louise-zhang commented 1 year ago

Thanks @omrozowicz-splunk

Can you tell me if you've ever used SCK (Splunk Collector for Kubernetes) on this node before? As in this case this can be something with translator from SCK.

Yes, we used SCK on this node before.

Another reason would be if the otel collector was forcefully killed when it was writing the file, do you recall something like this?

I can't recall

Right now your checkpoint path is var/addon/splunk/otel_pos2 so only this one is important from otel collector perspective.

Cool, we will update the values file to use var/addon/splunk/otel_pos2 as checkpoint path for all the nodes.

atoulme commented 1 year ago

Closing this issue as done. Please open a support case for further help if required.