internal_telemetry source panics with out of bound when allocation tracing enabled

AndrewChubatiuk commented 1 year ago

A note for the community

No response

Problem

Vector 0.29.1 fails with stacktrace, but pod is healthy

2023-04-28T11:44:42.460984Z  WARN sink{component_kind="sink" component_id=cloudwatch component_type=aws_cloudwatch_logs component_name=cloudwatch}: vector::sinks::util::retries: Retrying after error. error=CloudwatchError::Put: Error { code: "InternalFailure", request_id: "b81d6376-3a99-47bd-adea-aedfe25ec056" } internal_log_rate_limit=true
2023-04-28T18:18:27.743645Z  WARN sink{component_kind="sink" component_id=cloudwatch component_type=aws_cloudwatch_logs component_name=cloudwatch}: vector::sinks::util::retries: Retrying after error. error=CloudwatchError::Put: TransientError: connection closed before message completed internal_log_rate_limit=true
thread 'vector-worker' panicked at 'index out of bounds: the len is 128 but the index is 249', src/internal_telemetry/allocations/mod.rs:106:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2023-04-28T19:15:43.564063Z ERROR source{component_kind="source" component_id=internal_metrics component_type=internal_metrics component_name=internal_metrics}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.

Configuration

data_dir: /vector-data-dir
expire_metrics_secs: 300
api:
  enabled: false
  address: 0.0.0.0:8686
  playground: true
log_schema:
  host_key: host
  message_key: message
  source_type_key: source_type
  timestamp_key: timestamp
sources:
  kubernetes_logs:
    type: kubernetes_logs
    glob_minimum_cooldown_ms: 5000
  internal_metrics:
    type: internal_metrics
    scrape_interval_secs: 30
transforms:
  kubernetes_logs_remap_json:
    inputs:
      - kubernetes_logs
    type: remap
    source: |-
      # Prepare context object
      context = .

      if exists(.kubernetes.pod_labels."project.log-group") {
        context.log_group = .kubernetes.pod_labels."project.log-group"
      } else if exists(.kubernetes.pod_labels."eventsource-name") {
        context.log_group, _ = "eventsource-" + .kubernetes.pod_labels."eventsource-name"
      } else if exists(.kubernetes.pod_labels."eventbus-name") {
        context.log_group, _ = "eventbus-" + .kubernetes.pod_labels."eventbus-name"
      } else if exists(.kubernetes.pod_labels."sensor-name") {
        context.log_group, _ = "sensor-" + .kubernetes.pod_labels."sensor-name"
      } else {
        context.log_group = .kubernetes.container_name
      }

      del(context.message)

      # Parse JSON if valid
      structured, err = parse_json(.message)
      log(.message, level: "debug")
      if err != null {
        log(err, level: "debug")
        log(.message, level: "debug")
        structured.message = .message
      }
      structured.context = context

      # Add mandatory fields
      if !exists(structured.service) {
        log("Empty service, fallback to pod name", level: "debug")
        structured.service = .kubernetes.pod_name
      }
      if !exists(structured.buildVersion) {
        log("Empty buildVersion, fallback to container image", level: "debug")
        structured.buildVersion = .kubernetes.container_image
      }
      if !exists(structured.host) {
        log("Empty host, fallback to pod node name", level: "debug")
        structured.host = .kubernetes.pod_node_name
      }
      if !exists(structured.level) {
        log("Empty level, set to INFO", level: "debug")
        structured.level = "INFO"
      }
      if !exists(structured.timestamp) {
        log("Empty timestamp, set to now", level: "debug")
        structured.timestamp = now()
      }

      . = structured
  log_group_metric:
    inputs:
      - kubernetes_logs_remap_json
    type: log_to_metric
    metrics:
      - type: set
        field: context.log_group
        name: vector_log_group_info
        tags:
          log_group: 'project/{{ $.Values.global.cluster }}/{{ `{{ context.kubernetes.pod_namespace }}/{{ context.log_group }}` }}'
          pod_name: '{{ printf "{{ context.kubernetes.pod_name }}" }}'
          pod_namespace: '{{ printf "{{ context.kubernetes.pod_namespace }}" }}'
sinks:
  cloudwatch:
    buffer:
      type: disk
      max_size: 1000000000
    type: aws_cloudwatch_logs
    inputs:
      - kubernetes_logs_remap_json
    create_missing_group: true
    create_missing_stream: true
    group_name: 'project/{{ $.Values.global.cluster }}/{{ `{{ context.kubernetes.pod_namespace }}/{{ context.log_group }}` }}'
    compression: none
    region: '{{ $.Values.global.region }}'
    stream_name: '{{ printf "{{ context.kubernetes.pod_name }}" }}'
    encoding:
      codec: json
      timestamp_format: rfc3339
    healthcheck:
      enabled: false
  prometheus_sink:
    address: 0.0.0.0:9090
    inputs:
      - internal_metrics
      - log_group_metric
    type: prometheus_exporter

Version

0.29.1

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

spencergilbert commented 1 year ago

Hey @AndrewChubatiuk does this happen immediately when Vector is starting or does it take some time for it to surface?

AndrewChubatiuk commented 1 year ago

It appears suddenly and not on all nodes, haven't found any correlation yet

spencergilbert commented 1 year ago

Thanks, I'll see what I can do to reproduce this.

zamazan4ik commented 1 year ago

@AndrewChubatiuk Are you able to collect full backtrace with RUST_BACKTRACE=1 env variable for Vector? It could help with a reproduction of the issue. Thanks in advance.

zamazan4ik commented 1 year ago

I guess the problem is here: https://github.com/vectordotdev/vector/blob/master/src/internal_telemetry/allocations/mod.rs#L54

For tracking source_group_id u8 type is used, and the max value is bigger than 128. Probably, we need to limit source_group_id or just allocate more memory.

spencergilbert commented 1 year ago

I haven't specifically started looking into this but did see similar messages when shutting down Vector while reproducing another report.

AndrewChubatiuk commented 1 year ago

another problem regarding this issue is that vector's health endpoint keeps responding with HTTP 200

spencergilbert commented 1 year ago

another problem regarding this issue is that vector's health endpoint keeps responding with HTTP 200

That's tracked by https://github.com/vectordotdev/vector/issues/4250, the check is very naive today.

spencergilbert commented 1 year ago

Agree with @zamazan4ik's thoughts after a quick review, happy to take a contribution if someone's interested!

jszwedko commented 1 year ago

Seems to be the same as https://github.com/vectordotdev/vector/issues/16028 which we thought was fixed, but seemingly there is still a case that hits it.

fpytloun commented 3 months ago

This issue still happens in 0.38.0

thread 'vector-worker' panicked at src/internal_telemetry/allocations/mod.rs:111:13:
index out of bounds: the len is 128 but the index is 192
stack backtrace:
   0: rust_begin_unwind
             at ./rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/std/src/panicking.rs:647:5
   1: core::panicking::panic_fmt
             at ./rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/panicking.rs:72:14
   2: core::panicking::panic_bounds_check
             at ./rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/panicking.rs:208:5
   3: <vector::internal_telemetry::allocations::allocator::tracing_allocator::GroupedTraceableAllocator<A,T> as core::alloc::global::GlobalAlloc>::dealloc
   4: core::ptr::drop_in_place<metrics::key::Key>
   5: core::ops::function::FnOnce::call_once
   6: vector_core::metrics::Controller::capture_metrics
   7: vector::sources::internal_metrics::InternalMetrics::run::{{closure}}
   8: vector::topology::builder::Builder::build_sources::{{closure}}::{{closure}}
   9: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
  10: tokio::runtime::task::raw::poll
  11: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  12: tokio::runtime::task::raw::poll
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2024-05-22T09:06:09.443337Z ERROR source{component_kind="source" component_id=internal_metrics component_type=internal_metrics}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.

vectordotdev / vector