vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.76k stars 1.57k forks source link

Vector agent loses/does not send all metrics when being offline #21410

Open freak12techno opened 3 weeks ago

freak12techno commented 3 weeks ago

A note for the community

Problem

We are planning to integrate Vector into one of our projects. Our idea is to have a architecture where there are multiple servers, which all are sending data to a server Vector, which is sending data to Prometheus remote write. Problem is, if a machine Vector agent is running on is offline for some period of time (and it's the often case for us), Vector would lose some metrics. This happens almost all the times a machine Vector agent is running at is losing its internet access. Example:

изображение

Here I've disabled WiFi on my laptop I am running my agent on from 18:17 to 12:55, and there's a gap in all metrics between ~19:57 and ~12:57, so it effectively lost almost all the metrics.

I tried writing to Prometheus remote write directly from agent instead of writing it to server Vector, and it yielded the same result, so it doesn't seem like server Vector is the issue, it seems like there's some problem with Vector agent either not collecting metrics once it's offline, or sending it in a wrong way so they are not recorded in Prometheus.

This is critical for us, and we wonder if it's us who misconfigured something, or is there some kind of a bug in Vector agent that causes this.

Configuration

timezone = "UTC"

[sources.node_metrics]
type = "host_metrics"
namespace = "node"
scrape_interval_secs = 5

[sources.internal_metrics]
type = "internal_metrics"

[transforms.add_serial_metrics]
type = "remap"
inputs = ["node_metrics", "internal_metrics"]
source = """
.tags.host = get_env_var!("SERIAL")
.tags.hardware = "sierra"
"""

[sinks.vector_metrics]
type = "vector"
healthcheck.enabled = false
inputs = [ "add_serial_metrics" ]
address = "x.y.z.a:bbb"
compression = true
buffer.type = "disk"
buffer.max_size = 268435488
request.retry_max_duration_secs = 60

# Also tried this, to write to Prometheus directly, same result.
#[sinks.vector_metrics2]
#type = "prometheus_remote_write"
#healthcheck.enabled = false
#inputs = [ "add_serial_metrics" ]
#endpoint = "https://x.y.z.a:bbb/api/v1/write"
#auth = { strategy = "bearer", token = "xxx" }

Version

FROM timberio/vector:0.41.1-alpine

Debug Output

Once the internet is out, here's what happening in logs (a lot of repeated messages like this):

2024-10-02T15:50:14.719369Z  WARN sink{component_kind="sink" component_id=vector_metrics component_type=vector}:request{request_id=489}: vector::sinks::util::retries: Retrying after error. error=Request failed: status: Unavailable, message: "error trying to connect: tcp connect error: Connection refused (os error 111)", details: [], metadata: MetadataMap { headers: {} } internal_log_rate_limit=true
2024-10-02T15:50:14.719382Z DEBUG sink{component_kind="sink" component_id=vector_metrics component_type=vector}:request{request_id=489}: vector::sinks::util::retries: Retrying request. delay_ms=40657
2024-10-02T15:50:15.365873Z DEBUG transform{component_kind="transform" component_id=add_serial_metrics component_type=remap}: vector::utilization: utilization=0.0003001317441953306
2024-10-02T15:50:24.831105Z DEBUG sink{component_kind="sink" component_id=vector_metrics component_type=vector}:request{request_id=398}: hyper::client::connect::http: connecting to x.y.z.a:bbb
2024-10-02T15:50:24.833180Z  WARN sink{component_kind="sink" component_id=vector_metrics component_type=vector}:request{request_id=398}: vector::sinks::util::retries: Retrying after error. error=Request failed: status: Unavailable, message: "error trying to connect: tcp connect error: Connection refused (os error 111)", details: [], metadata: MetadataMap { headers: {} } internal_log_rate_limit=true
2024-10-02T15:50:24.833240Z DEBUG sink{component_kind="sink" component_id=vector_metrics component_type=vector}:request{request_id=398}: vector::sinks::util::retries: Retrying request. delay_ms=7551
2024-10-02T15:50:25.361765Z DEBUG transform{component_kind="transform" component_id=add_serial_metrics component_type=remap}: vector::utilization: utilization=0.00026859163336423795
2024-10-02T15:50:30.367008Z DEBUG transform{component_kind="transform" component_id=add_serial_metrics component_type=remap}: vector::utilization: utilization=0.0002844311638970257
2024-10-02T15:50:32.390641Z DEBUG sink{component_kind="sink" component_id=vector_metrics component_type=vector}:request{request_id=398}: hyper::client::connect::http: connecting to x.y.z.a:bbb
2024-10-02T15:50:32.392662Z  WARN sink{component_kind="sink" component_id=vector_metrics component_type=vector}:request{request_id=398}: vector::sinks::util::retries: Internal log [Retrying after error.] is being suppressed to avoid flooding.

Then, once a machine is back online, a lot of repeated messages like these: https://gist.github.com/freak12techno/a79d04e226d7e33819162a6da76cb144

Example Data

No response

Additional Context

No response

References

No response

jszwedko commented 3 weeks ago

Hi @freak12techno ,

Thanks for filing this issue. I'm a little confused though, when Vector is offline, it won't be able to collect metrics via the host_metrics and internal_metrics sources as both of those sources are "realtime" so I think what you are seeing is expected behavior. Am I missing something? 🤔