vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.97k stars 1.59k forks source link

An error occurred that Vector couldn’t handle: failed to encode record: BufferTooSmall when using http server source causing the vector to stop ingesting data #18346

Open raghu999 opened 1 year ago

raghu999 commented 1 year ago

A note for the community

Problem

We had a similar issue when using the Splunk HEC source and raised a bug report #17670, We started using the HTTP source and are now seeing a buffer error with the HTTP source causing the vector to stop ingesting any new data the containers are entering a restart loop with OOM error on Kubernetes and on vector we are seeing the below error.

K8's container

team-ingest-prod-0                0/1     OOMKilled          1          1m    19x.x.x.x    vm-ch2-c   <none>

Vector Error:

2023-08-10T11:48:04.553373Z  INFO vector::app: Log level is enabled. level=“vector=info,codec=info,vrl=info,file_source=info,tower_limit=trace,rdkafka=info,buffers=info,lapin=info,kube=info”
2023-08-10T11:48:04.554149Z  INFO vector::app: Loading configs. paths=[“/etc/vector”]
2023-08-10T11:48:06.390284Z  WARN vector::sinks::elasticsearch::common: DEPRECATION, use of deprecated option `endpoint`. Please use `endpoints` option instead.
2023-08-10T11:48:06.390423Z  WARN vector_core::tls::settings: The `verify_certificate` option is DISABLED, this may lead to security vulnerabilities.
2023-08-10T11:48:06.390436Z  WARN vector_core::tls::settings: The `verify_hostname` option is DISABLED, this may lead to security vulnerabilities.
2023-08-10T11:48:06.437918Z  WARN vector::sinks::elasticsearch::common: DEPRECATION, use of deprecated option `suppress_type_name`. Please use `api_version` option instead.
2023-08-10T11:48:07.994646Z  WARN vector::sinks::elasticsearch::common: DEPRECATION, use of deprecated option `endpoint`. Please use `endpoints` option instead.
2023-08-10T11:48:07.994726Z  WARN vector_core::tls::settings: The `verify_certificate` option is DISABLED, this may lead to security vulnerabilities.
2023-08-10T11:48:07.994738Z  WARN vector_core::tls::settings: The `verify_hostname` option is DISABLED, this may lead to security vulnerabilities.
2023-08-10T11:48:08.039273Z  WARN vector::sinks::elasticsearch::common: DEPRECATION, use of deprecated option `suppress_type_name`. Please use `api_version` option instead.
2023-08-10T11:48:08.107306Z  INFO vector::topology::runner: Running healthchecks.
2023-08-10T11:48:08.107388Z  INFO vector::topology::builder: Healthcheck passed.
2023-08-10T11:48:08.107538Z  INFO vector::topology::builder: Healthcheck disabled.
2023-08-10T11:48:08.107558Z  INFO source{component_kind=“source” component_id=o11y_ingest_http component_type=http_server component_name=o11y_ingest_http}: vector::sources::util::http::prelude: Building HTTP server. address=0.0.0.0:8081
2023-08-10T11:48:08.107659Z  INFO vector: Vector has started. debug=“false” version=“0.31.0” arch=“x86_64" revision=“38c3f0b 2023-05-22 17:38:48.655488673”
2023-08-10T11:48:08.107867Z  INFO vector::sinks::prometheus::exporter: Building HTTP server. address=0.0.0.0:8181
2023-08-10T11:48:08.114700Z  INFO vector::internal_events::api: API server running. address=127.0.0.1:8686 playground=off
2023-08-10T11:48:08.122775Z  INFO vector::topology::builder: Healthcheck passed.
2023-08-10T11:48:08.125420Z  INFO vector::topology::builder: Healthcheck passed.
2023-08-10T11:48:43.659271Z ERROR transform{component_kind=“transform” component_id=remap component_type=remap component_name=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
2023-08-10T11:48:43.789150Z ERROR transform{component_kind=“transform” component_id=remap component_type=remap component_name=remap}: vector::topology: An error occurred that Vector couldn’t handle: failed to encode record: BufferTooSmall.
2023-08-10T11:48:43.789228Z  INFO vector: Vector has stopped.
2023-08-10T11:48:43.801112Z  INFO vector::topology::runner: Shutting down... Waiting on running components. remaining_components=“o11y_elastic_internal, metrics, vector, prom_exporter, o11y_ingest_http, o11y_maas_remote_write, internal_logs, remap_un_nest, o11y_ingest, o11y_elastic, remap, internal_metrics” time_remaining=“59 seconds left”
2023-08-10T11:48:48.789690Z  INFO vector::topology::runner: Shutting down... Waiting on running components. remaining_components=“o11y_elastic_internal, remap_un_nest” time_remaining=“54 seconds left”
2023-08-10T11:48:53.790245Z  INFO vector::topology::runner: Shutting down... Waiting on running components. remaining_components=“remap_un_nest” time_remaining=“49 seconds left”

Configuration

data_dir = "/vector-data-dir"

[log_schema]
host_key = "host.hostname"
timestamp_key = "@timestamp"

[api]
enabled = true
address = "127.0.0.1:8686"
playground = false

[sources.internal_metrics]
type = "internal_metrics"

[sources.o11y_ingest]
address = "0.0.0.0:8080"
valid_tokens = [ "<token>" ]
type = "splunk_hec"

[sources.o11y_ingest_http]
address = "0.0.0.0:8081"
type = "http_server"
"framing.method" = "character_delimited"
"framing.character_delimited.delimiter" = ","
"framing.newline_delimited.max_length" = 8_192
"decoding.codec" = "json"
"encoding.codec" = "json"

[sources.internal_logs]
type = "internal_logs"

[transforms.remap_un_nest]
type = "remap"
inputs = [ "o11y_ingest", "o11y_ingest_http" ]
source = """
  .message = parse_json!(.message)
  . = unnest!(.message)"""

[transforms.remap]
type = "remap"
inputs = [ "remap_un_nest" ]
source = """
  . = merge!(., .message)
  parsed_json, err = parse_json(string!(.message))
  if err == null {
      ., err = merge(., parsed_json)
      del(.message)
  } else {
      del(.source_type)
      del(.message)
  }
  del(.source_type)
  del(.path)"""

[transforms.metrics]
type = "remap"
inputs = [ "internal_metrics" ]
source = """
.tags.job = "team-ingest-prod"
.tags.tenant_name = "team"
.tags.instance = "ingest-prod-ch2.example.com:443"
.tags.environment = "prod"
.tags.dc = "ch2"\
"""

[transforms.vector]
type = "remap"
inputs = [ "internal_logs" ]
source = """
    .vector = del(.)
    ."@timestamp" = del(.vector.timestamp)
    .message = del(.vector.message)
    .log = {}
    .log.logger = del(.vector.metadata.module_path)
    .log.level = downcase(string!(del(.vector.metadata.level)))
    .ecs = {}
    .ecs.version = "1.7"
    .cse = {}
    .cse.version = "1.0"
    .agent = {}
    .agent.type = "vector"
    .agent.version = "0.31.0"
    .event = {}
    .event.dataset = ["app"]
    .host = {}
    .host.hostname = get_env_var("ELK_HOSTNAME") ?? null
    .cloud = {}
    .cloud.provider = get_env_var("CLOUD_PROVIDER") ?? null
    .cloud.region = get_env_var("CLOUD_REGION") ?? null
    .tenant = {}
    .tenant.name = "team_prod"
    .service = {}
    .service.name = "team_prod"
    .host.o11y_hostname = get_hostname!()"""

[sinks.prom_exporter]
type = "prometheus_exporter"
inputs = [ "internal_metrics" ]
address = "0.0.0.0:8181"

[sinks.o11y_maas_remote_write]
type = "prometheus_remote_write"
inputs = [ "metrics" ]
default_namespace = "team"
endpoint = "https://<prometheus>.example.com:9090/api/v1/write"
tenant_id = "team_telemetry"

  [sinks.o11y_maas_remote_write.healthcheck]
  enabled = false

[sinks.o11y_elastic]
type = "elasticsearch"
inputs = [ "remap" ]
endpoint = "${ELK_URL}"
compression = "gzip"
suppress_type_name = true

  [sinks.o11y_elastic.bulk]
  index = "logz-team-app"
  action = "create"

  [sinks.o11y_elastic.auth]
  strategy = "basic"
  user = "${ELK_USERNAME}"
  password = "${ELK_PASSWORD}"

  [sinks.o11y_elastic.buffer]
  type = "disk"
  max_size = 900_000_000_000
  when_full = "drop_newest"

  [sinks.o11y_elastic.batch]
  max_bytes = 8_300_000
  timeout_secs = 8

  [sinks.o11y_elastic.request]
  concurrency = "adaptive"
  timeout_secs = 120

  [sinks.o11y_elastic.tls]
  verify_certificate = false
  verify_hostname = false

[sinks.o11y_elastic_internal]
type = "elasticsearch"
inputs = [ "vector" ]
compression = "gzip"
endpoint = "${ELK_URL}"
suppress_type_name = true

  [sinks.o11y_elastic_internal.bulk]
  index = "logz-forwarder"
  action = "create"

  [sinks.o11y_elastic_internal.auth]
  strategy = "basic"
  user = "${ELK_USERNAME}"
  password = "${ELK_PASSWORD}"

  [sinks.o11y_elastic_internal.buffer]
  type = "disk"
  max_size = 400_000_000
  when_full = "drop_newest"

  [sinks.o11y_elastic_internal.batch]
  max_bytes = 8_300_000
  timeout_secs = 8

  [sinks.o11y_elastic_internal.request]
  retry_attempts = 10

  [sinks.o11y_elastic_internal.tls]
  verify_certificate = false
  verify_hostname = false

Version

0.31.x

Debug Output

No response

Example Data

No response

Additional Context

Vector is running in Kubernetes and this specific client has large payloads with close to 6000-8000 fields in their entire dataset.

References

17670 Faced similar issue with splunk hec source we moved to http source and are seeing a new error.

jszwedko commented 1 year ago

Thanks for this report @raghu999 ! Did you share the right logs though? The error logs there seem to be showing an error with the disk buffers. Do you have logs showing the panic in the http_server source?

raghu999 commented 1 year ago

@jszwedko Updated the issue we are only seeing the buffer error but it is the same dataset that we were seeing an error when using the Splunk HEC source and when we switched to using the HTTP server source we now are seeing a new error that is related to buffers. We confirmed that the disk buffer is not full when this issue happens.

jszwedko commented 1 year ago

Gotcha, thanks for clarifying @raghu999 ! It sounds more like a bug in the disk buffer implementation than the http source then.

tobz commented 1 year ago

This error comes from our limitations around the maximum record size allowed in a disk buffer, which is set to 128MB. A record is essentially a chunk of events written all together, which based on the sources used in the given configuration, would be any events decoded from a single request.

This would imply that if the error above is being hit, that a single request sent to either the splunk_hec or http_server source is very large, something like at least 90-100MB: each decoded event will have some internal overhead, so the request doesn't need to be 128MB by itself.

Based on your knowledge of the clients sending to your Vector instance, is it possible for requests to be that large (90-100MB)?

maxemontio commented 1 year ago

Hi everyone! Facing the same issue here:

Sep 17 19:40:46 vector-2 vector[633402]: 2023-09-17T19:40:46.412419Z ERROR transform{component_kind="transform" component_id=parsing_transform_new component_type=remap component_name=parsing_transform_new}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
Sep 17 19:40:46 vector-2 vector[633402]: 2023-09-17T19:40:46.414728Z ERROR transform{component_kind="transform" component_id=parsing_transform_new component_type=remap component_name=parsing_transform_new}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
Sep 17 19:40:46 vector-2 vector[633402]: 2023-09-17T19:40:46.415193Z  INFO vector: Vector has stopped.

This is the transform:

parsing_transform_new:
  type: remap
  inputs:
    - profile_route.parsing
  source: |-
    if (springboot_match, err = parse_regex(.message, r'^(?P<timestamp>\d+(?:-\d+){2}\s+\d+(?::\d+){2}\.\d+)\s+(?P<log_level>\S+)\s+(?P<pid>\d+)\s+---\s+\[(?P<thread_name>[\s+\S]*?)\]\s+(?P<logger_name>\S+)\s+:\s+(?s)(?P<message>.*)'); err == null) { # java_springboot
      .timestamp = parse_timestamp!(springboot_match.timestamp, "%F %T.%3f")
      .log_level = downcase!(springboot_match.log_level)
      .pid = to_int!(springboot_match.pid)
      .thread_name = strip_whitespace!(springboot_match.thread_name)
      .logger_name = springboot_match.logger_name
      .message = to_string(springboot_match.message)
    } else if (python_match, err = parse_regex(.message, r'^\[(?P<timestamp>\d+-\d+-\d+\s\d+:\d+:\d+,\d+)\s(?P<log_level>\w+)\s(?P<logger_name>\S+)\s(?P<thread_name>\w+\S+\s\S+)]\s(?s)(?P<message>.*)'); err == null) { # python
      .timestamp = parse_timestamp!(python_match.timestamp, "%F %T,%3f")
      .log_level  = downcase!(python_match.log_level)
      .logger_name = python_match.logger_name
      .thread_name = strip_whitespace!(python_match.thread_name)
      .message = to_string(python_match.message)
    } else if (vault_match, err = parse_regex(.message, r'^(?P<timestamp>\d+-\d+-\d+T\d+:\d+:\d+.\d+Z)\s\[(?P<log_level>\w+)]\s+(?s)(?P<message>.*)$'); err == null) { # vault
      .timestamp = vault_match.timestamp
      .log_level = downcase!(vault_match.log_level)
      .message = to_string(vault_match.message)
    } else {
      .malformed = true
    }

In my case it is unlikely for requests to be even close to 90-100mb. v0.32.0

awangc commented 11 months ago

We're seeing this error again with a different application. Some relevant log lines

vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
2023-11-15T23:45:07.499790Z DEBUG transform{component_kind="transform" component_id=remap component_type=remap}: vector::topology::builder: Synchronous transform finished with an error.
2023-11-15T23:45:07.499819Z ERROR transform{component_kind="transform" component_id=remap component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
2023-11-15T23:45:07.499874Z  INFO vector: Vector has stopped.
2023-11-15T23:45:07.499934Z DEBUG vector_buffers::variants::disk_v2::writer: Writer marked as closed.
2023-11-15T23:45:07.500314Z DEBUG source{component_kind="source" component_id=internal_logs component_type=internal_logs}: vector::topology::builder: Source finished normally.

The source is http_server. The symptoms is that memory utilization in the pod would grow to the point of being OOM killed. But given the log above 'source finished normally' vector might just have exited?

version we used: v0.34.0

A few questions:

  1. So the remap transform is also accessing disk buffer while doing the transform?
  2. If transforms also access disk buffers, is there a configuration on the disk buffer size for transforms?
  3. When we removed the transform our pods would no longer be OOMkilled, so this record limitation is not a record size limitation but a transform limitation?
  4. Is there a way for vector to print/obtain the event size inside a remap transform?
  5. What does Synchronous transform finished with an error. mean (what function call might trigger this error message , or maybe, are all transforms synchronous. what characterizes a "synchronous" transform?)
kefiras commented 8 months ago

I am seeing similar issues

2024-02-20T09:21:25.663269Z ERROR transform{component_kind="transform" component_id=kubernetes_application component_type=remap component_name=kubernetes_application}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
2024-02-20T09:21:25.663344Z  INFO vector: Vector has stopped.
2024-02-20T09:21:25.663365Z ERROR transform{component_kind="transform" component_id=route_logs component_type=route component_name=route_logs}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
2024-02-20T09:21:25.676755Z ERROR transform{component_kind="transform" component_id=containerlogs component_type=remap component_name=containerlogs}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
2024-02-20T09:21:25.680198Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="splunk_kube_audit, fluentbit, kubernetes_application, prometheus_exporter, splunk_container_logs, kubernetes_audit, httpd, vector_metrics, kubernetes_infrastructure, route_logs, journal_logs, containerlogs" time_remaining="59 seconds left"

The containers exists upon certain load , around 10k events/sec.

kefiras commented 1 week ago

@awangc have You managed to find the root cause or workaround ? I've seen similar issues while sending logs from upstream vector pods that are part of OpenShift and the default batch size is 10MB I believe. I've been testing with sending just generic nginx logs so really small events. Upon certain load the vector would exit with same error related to the buffer. I have observed the issue not happening when the batch size on upstream vector is set to 1MB (so 10 times less).

@jszwedko is there any doc how are the transforms related to a disk buffers ?

jszwedko commented 1 week ago

@jszwedko is there any doc how are the transforms related to a disk buffers ?

Transforms emit data to an output channel which is then consumed by buffers. Is that the sort of detail you were looking for? Or do you have different sort of question?

For this issue, I know it is a lot to ask, but if anyone is able to create a standalone reproduction, that would aid with debugging.