vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.14k stars 1.6k forks source link

`vector.component_errors_total` metric should be consistently non-monotonic (or vice versa) #18845

Open marcus-crane opened 1 year ago

marcus-crane commented 1 year ago

A note for the community

Problem

Hi there,

I'm currently in the process of migrating our production logging pipeline over to Vector and as part of that, I've been setting up some dashboards.

During this, I noticed that the vector.component_errors_total metric appears to be inconsistent, with it acting monotonically (or not) depending on how you filter it.

The metric, within Datadog which we use, is a count and count metrics are not required to be monotonic to be clear.

As an example, here is the aforementioned metric in Datadog, showing errors for the docker_logs source:

CleanShot 2023-10-16 at 11 42 44@2x

The metric, filtered in this way, acts monotonically.

This is in contrast to the socket sink which we are currently using:

CleanShot 2023-10-16 at 11 42 31@2x

Here, the metric does not act monotonically.

This actually tripped me up initially, as I had set up some metrics for the socket sink and then, with my mental model being that these are not monotonic metrics, I interpreted the docker_logs metric as saying the Docker daemon is on fire.

It is possible to get a more accurate view by applying the monotonic_diff function to a monotonic metric (like docker_logs errors) but that function is not usable in most handy non-timeseries views such as Query Value.

Ideally, I think I would prefer that all metrics were consistently non-monotonic but mainly I'm just curious to know if this has come up, as I haven't discovered anything through searching.

Configuration

# Global options

# Using @timestamp has some special properties in Logstash, where deserialising JSON will automatically use
# it as a field. There is also an issue with the `date` filter for Logstash where it will truncate timestamps
# beyond a certain level of resolution because it uses JodaTime under the hood
# See https://github.com/elastic/logstash/issues/10822#issuecomment-954149331 for more information.
[log_schema]
timestamp_key = "@timestamp"

# Pipeline options

[sources.docker_logs]
type = "docker_logs"

[sources.internal_metrics]
type = "internal_metrics"
scrape_interval_secs = 10 # Same as our Datadog agents

[transforms.ec2_metadata]
type = "aws_ec2_metadata"
inputs = [ "docker_logs" ]
namespace = "ec2"
fields = [ "availability-zone" ]
refresh_interval_secs = 86400 # 1 day -- The data we care about here will never change post-launch of an instance

[transforms.rename_fields]
type = "remap"
inputs = [ "ec2_metadata" ]
source = """
# We just want to do the bare minimum here to get some nice benefits while deferring the choice of field names
# to Logstash so that we can do bulk changes without redeploying Vector with a new config on all hosts.
# It's worth noting that if `transforms.ec2_metadata` fails, this pipeline will continue without issue.

# Docker Metadata
."docker.id" = del(.container_id)
."docker.image" = del(.image)
."docker.launched_at" = del(.container_created_at)
."docker.name" = del(.container_name)

# EC2 Metadata -- We rename here because it can't dynamically add more fields without redeployment unlike labels
# It's possible the metadata transform fails so we check for fields existing
if exists(.ec2) {
  ."ec2.availability_zone" = del(.ec2."availability-zone")
}
."ec2.host" = del(.host)

if exists(.label) {
  ."docker.built_at" = del(.label."org.opencontainers.image.created")
  ."ecs.cluster" = del(.label."com.amazonaws.ecs.cluster")
  ."ecs.family" = del(.label."com.amazonaws.ecs.task-definition-family")
  ."ecs.task.arn" = del(.label."com.amazonaws.ecs.task-arn")
  ."ecs.task.name" = del(.label."com.amazonaws.ecs.container-name")
  ."ecs.task.version" = del(.label."com.amazonaws.ecs.task-definition-version")
}

if exists(.label."org.opencontainers.image.url") && exists(.label."org.opencontainers.image.version") {
  commit, err = join([.label."org.opencontainers.image.url", "/commit/", .label."org.opencontainers.image.version"])
  if err == null {
    ."git.commit" = commit
  }
}

."log.stream" = del(.stream)
."source.type" = del(.source_type)

del(.ec2)
del(.label)
"""

[sinks.log_forwarder]
type = "socket"
inputs = [ "rename_fields" ]
address = "<load balancer>"
mode = "tcp"

[sinks.log_forwarder.buffer]
type = "disk"
max_size = 10000000000 # 10 GB | Lowest that any host's free disk space dropped for the month of Sept 2023 was 14GB

[sinks.log_forwarder.healthcheck]
enabled = false

[sinks.log_forwarder.keepalive]
time_secs = 10

[sinks.log_forwarder.tls]
enabled = true
ca_file = "/etc/ssl/ca/ca-cert.pem"

[sinks.log_forwarder.encoding]
codec = "json"

# Here we collect metrics and report them to Datadog but we need to ensure that we remove some high cardinality tags
# as this configuration will be rolled out to all ECS hosts.
# To get an example metric, you can run `vector tap | grep internal_metrics` to see the format
# Note that we don't use strings for references here as internally, metrics are actual JSON blobs
# and not the flat (ie; `blah.bleh`) style event we use for our logs
[transforms.trim_metrics]
type = "remap"
inputs = [ "internal_metrics" ]
source = """
# Emitted by the following metrics:
# - vector.component_errors_total
del(.tags.container_id)

# Emitted by the following metrics:
# - vector.component_received_event_bytes_total
# - vector.component_received_events_total
del(.tags.container_name)
"""

[sinks.datadog_metrics]
type = "datadog_metrics"
inputs = [ "trim_metrics" ]
default_api_key = "<key>"
default_namespace = "vector"

[sinks.datadog_metrics.batch]
timeout_secs = 20

[sinks.datadog_metrics.healthcheck]
enabled = false

[api]
enabled = true

Version

vector 0.33.0 (x86_64-unknown-linux-gnu 89605fb 2023-09-27 14:18:24.180809939)

Debug Output

N/A

Example Data

N/A

Additional Context

We run Vector on Ubuntu AWS EC2 hosts where it receives logs from a Docker Daemon configured to run AWS ECS workloads.

References

No response

neuronull commented 1 year ago

Hi and thanks for the bug report,

Hmm counter metrics emitted from Vector should always be monotonic... The only reason I'm aware of for them to go back would be if the process is restarted and thus reset to zero. Can you confirm that in those cases with the socket source, that vector was running continuously and didn't restart?

nabokihms commented 1 year ago

It can also happen because of metrics expiration, but also doesn't seem like the valid case for the issue https://github.com/vectordotdev/vector/discussions/16322#discussioncomment-4936957

marcus-crane commented 1 year ago

Hi and thanks for the bug report,

Hmm counter metrics emitted from Vector should always be monotonic... The only reason I'm aware of for them to go back would be if the process is restarted and thus reset to zero. Can you confirm that in those cases with the socket source, that vector was running continuously and didn't restart?

Hmm, I had a look at a random host and it doesn't seem that there is any instance of Vector being restarted and/or the host being killed.

CleanShot 2023-10-23 at 18 48 16@2x

I also had a brief look over our Vector instances across hosts, first by source errors and then by sink errors. Nothing scientific (I'll have a closer look tomorrow as it's a public holiday here) but I wonder if it's more broadly that sources are monotonic while sinks aren't.

In saying that, we don't currently have any other sources or sinks that emit errors so I don't have any evidence of that but I'll see if I can generate some errors locally tomorrow and test if that holds up

dsmith3197 commented 1 year ago

Hi @marcus-crane,

Could you share the corresponding error logs coming from the Docker Logs source? This will help determine what is going wrong here.

marcus-crane commented 1 year ago

Hey @dsmith3197,

Sure, here are some example graphs from earlier today:

CleanShot 2023-11-08 at 15 23 51

CleanShot 2023-11-08 at 15 24 06

and here are some logs:

$ journalctl -u vector | grep sink | less
Nov 08 01:01:01 ecs-08d1a609999f93abf vector[14201]: 2023-11-08T01:01:01.319567Z ERROR sink{component_kind="sink" component_id=log_forwarder component_type=socket component_name=log_forwarder}: vector::internal_events::socket: Error sending data. error=Connection reset by peer (os error 104) error_code="socket_send" error_type="writer_failed" stage="sending" mode=tcp internal_log_rate_limit=true
Nov 08 01:01:01 ecs-08d1a609999f93abf vector[14201]: 2023-11-08T01:01:01.319605Z ERROR sink{component_kind="sink" component_id=log_forwarder component_type=socket component_name=log_forwarder}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Error sending data." internal_log_rate_limit=true
Nov 08 01:01:01 ecs-08d1a609999f93abf vector[14201]: 2023-11-08T01:01:01.325213Z ERROR sink{component_kind="sink" component_id=log_forwarder component_type=socket component_name=log_forwarder}: vector::internal_events::common: Internal log [Unable to connect.] has been suppressed 2 times.
Nov 08 01:01:01 ecs-08d1a609999f93abf vector[14201]: 2023-11-08T01:01:01.325230Z ERROR sink{component_kind="sink" component_id=log_forwarder component_type=socket component_name=log_forwarder}: vector::internal_events::common: Unable to connect. error=Connect error: Connection refused (os error 111) error_code="failed_connecting" error_type="connection_failed" stage="sending" internal_log_rate_limit=true
Nov 08 01:01:01 ecs-08d1a609999f93abf vector[14201]: 2023-11-08T01:01:01.887022Z ERROR sink{component_kind="sink" component_id=log_forwarder component_type=socket component_name=log_forwarder}: vector::internal_events::socket: Internal log [Error sending data.] is being suppressed to avoid flooding.
Nov 08 01:01:01 ecs-08d1a609999f93abf vector[14201]: 2023-11-08T01:01:01.887052Z ERROR sink{component_kind="sink" component_id=log_forwarder component_type=socket component_name=log_forwarder}: vector_common::internal_event::component_events_dropped: Internal log [Events dropped] is being suppressed to avoid flooding.
Nov 08 01:01:01 ecs-08d1a609999f93abf vector[14201]: 2023-11-08T01:01:01.889805Z ERROR sink{component_kind="sink" component_id=log_forwarder component_type=socket component_name=log_forwarder}: vector::internal_events::common: Internal log [Unable to connect.] is being suppressed to avoid flooding.

and just to confirm, Vector has been running the whole time as far as systemd is concerned:

$ systemctl status vector
â—Ź vector.service - Vector
     Loaded: loaded (/lib/systemd/system/vector.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2023-10-18 21:17:47 UTC; 2 weeks 6 days ago

As far as the connection errors themselves, they're valid and come about due to an instance behind a load balancer becoming unhealthy at certain times.

Now that I think about it, I suppose it's just the case that when the socket connection is reset, that causes the metric buffer to be reset, even though Vector itself hasn't restarted. It's just a bit misleading because the only errors we get from that sink are ones that come about from a connection reset rather than "handled" errors that would presumably cause the metric to increase monotonically?

dsmith3197 commented 1 year ago

Thanks for the additional information @marcus-crane! Could you share the error logs for the Docker Logs source as well? I would like to confirm what errors the Docker Logs source is encountering. Thank you!