vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.66k stars 1.56k forks source link

Vector stopped sending logs to Datadog until restarted #5140

Closed mikehardenize closed 2 years ago

mikehardenize commented 3 years ago

Vector Version

# rpm -qa|grep vector
vector-0.10.0-1.x86_64
# vector --version
vector 0.10.0 (g0f0311a x86_64-unknown-linux-gnu 2020-07-22)

Vector Configuration File

[sources.netscan_source]
  type         = "file"
  include      = ["/opt/nomad/data/alloc/*/alloc/logs/netscan.stdout.*"]
  oldest_first = true
  ignore_older = 86400
  host_key     = "hostname"

  [sources.netscan_source.multiline]
    start_pattern     = "^[^\\s]"
    condition_pattern = "^\\s"
    mode              = "continue_through"
    timeout_ms        = 1000

[transforms.netscan_transform]
  type            = "add_fields"
  inputs          = ["netscan_source"]
  fields.hostname = "netscan"
  fields.service  = "netscan"
  fields.ddsource = "vector"
  fields.ddtags   = "netscan,location:us-east-1a"

[sinks.datadog_sink]
  type           = "datadog_logs"
  inputs         = ["netscan_transform"]
  api_key        = "*OBFUSCATED FOR GITHUB ISSUE*"
  encoding.codec = "json"

  [sinks.datadog_sink.tls]
    enabled = true
    ca_path = "/etc/ssl/certs/ca-bundle.crt"

Expected Behavior

Vector should constantly send logs to datadog

Actual Behavior

Vector was sending logs to datadog, and then stopped, for several days, until I restarted it.

Example Data

Looking at the vector logs today (20th November), these are the last two entries:

Nov 16 12:45:09.866  WARN source{name=netscan_source type=file}:file_server: file_source::file_watcher: Found line that exceeds max_line_bytes; discarding. rate_limit_secs=30
Nov 17 00:20:14.402 ERROR connection{host=intake.logs.datadoghq.com port=10516}: vector::internal_events::tcp: connection disconnected. error=Connection reset by peer (os error 104)

When I restarted vector, it took a long time, and the following logs were added:

Nov 20 10:04:09.993  INFO vector: Shutting down.
Nov 20 10:04:14.996  INFO vector::topology: Shutting down... Waiting on: netscan_transform, datadog_sink, netscan_source. 54 seconds left
Nov 20 10:04:19.996  INFO vector::topology: Shutting down... Waiting on: netscan_transform, datadog_sink, netscan_source. 49 seconds left
Nov 20 10:04:24.995  INFO vector::topology: Shutting down... Waiting on: netscan_transform, datadog_sink, netscan_source. 44 seconds left
Nov 20 10:04:29.995  INFO vector::topology: Shutting down... Waiting on: netscan_transform, datadog_sink, netscan_source. 39 seconds left
Nov 20 10:04:34.995  INFO vector::topology: Shutting down... Waiting on: netscan_transform, datadog_sink, netscan_source. 34 seconds left
Nov 20 10:04:39.995  INFO vector::topology: Shutting down... Waiting on: netscan_transform, datadog_sink, netscan_source. 29 seconds left
Nov 20 10:04:44.995  INFO vector::topology: Shutting down... Waiting on: netscan_transform, datadog_sink, netscan_source. 24 seconds left
Nov 20 10:04:49.995  INFO vector::topology: Shutting down... Waiting on: netscan_transform, datadog_sink, netscan_source. 19 seconds left
Nov 20 10:04:54.995  INFO vector::topology: Shutting down... Waiting on: netscan_transform, datadog_sink, netscan_source. 14 seconds left
Nov 20 10:04:59.995  INFO vector::topology: Shutting down... Waiting on: netscan_transform, datadog_sink, netscan_source. 9 seconds left
Nov 20 10:05:04.995  INFO vector::topology: Shutting down... Waiting on: netscan_transform, datadog_sink, netscan_source. 4 seconds left
Nov 20 10:05:09.995 ERROR vector::shutdown: Source 'netscan_source' failed to shutdown before deadline. Forcing shutdown.
Nov 20 10:05:09.996 ERROR vector::topology: Failed to gracefully shut down in time. Killing: netscan_transform, datadog_sink, netscan_source

After restarting, vector successfully read the last 24 hours of logs and sent them to datadog immediately, and then continued working.

ktff commented 3 years ago

@mikehardenize this sounds a lot like issue #3928 whose fix will be in v0.11. So can you check with the latest nightly if the issue is till present?

mikehardenize commented 3 years ago

I can't really upgrade this system to nightly. That issue does look to be highly relevant though. I am happy to assume that this issue is the same and switch to v0.11 when it comes out.

jszwedko commented 2 years ago

Closing as duplicate of https://github.com/vectordotdev/vector/issues/3928