Open marcus-crane opened 1 year ago
Hi and thanks for the bug report,
Hmm counter metrics emitted from Vector should always be monotonic... The only reason I'm aware of for them to go back would be if the process is restarted and thus reset to zero.
Can you confirm that in those cases with the socket
source, that vector was running continuously and didn't restart?
It can also happen because of metrics expiration, but also doesn't seem like the valid case for the issue https://github.com/vectordotdev/vector/discussions/16322#discussioncomment-4936957
Hi and thanks for the bug report,
Hmm counter metrics emitted from Vector should always be monotonic... The only reason I'm aware of for them to go back would be if the process is restarted and thus reset to zero. Can you confirm that in those cases with the
socket
source, that vector was running continuously and didn't restart?
Hmm, I had a look at a random host and it doesn't seem that there is any instance of Vector being restarted and/or the host being killed.
I also had a brief look over our Vector instances across hosts, first by source errors and then by sink errors. Nothing scientific (I'll have a closer look tomorrow as it's a public holiday here) but I wonder if it's more broadly that sources are monotonic while sinks aren't.
In saying that, we don't currently have any other sources or sinks that emit errors so I don't have any evidence of that but I'll see if I can generate some errors locally tomorrow and test if that holds up
Hi @marcus-crane,
Could you share the corresponding error logs coming from the Docker Logs source? This will help determine what is going wrong here.
Hey @dsmith3197,
Sure, here are some example graphs from earlier today:
and here are some logs:
$ journalctl -u vector | grep sink | less
Nov 08 01:01:01 ecs-08d1a609999f93abf vector[14201]: 2023-11-08T01:01:01.319567Z ERROR sink{component_kind="sink" component_id=log_forwarder component_type=socket component_name=log_forwarder}: vector::internal_events::socket: Error sending data. error=Connection reset by peer (os error 104) error_code="socket_send" error_type="writer_failed" stage="sending" mode=tcp internal_log_rate_limit=true
Nov 08 01:01:01 ecs-08d1a609999f93abf vector[14201]: 2023-11-08T01:01:01.319605Z ERROR sink{component_kind="sink" component_id=log_forwarder component_type=socket component_name=log_forwarder}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Error sending data." internal_log_rate_limit=true
Nov 08 01:01:01 ecs-08d1a609999f93abf vector[14201]: 2023-11-08T01:01:01.325213Z ERROR sink{component_kind="sink" component_id=log_forwarder component_type=socket component_name=log_forwarder}: vector::internal_events::common: Internal log [Unable to connect.] has been suppressed 2 times.
Nov 08 01:01:01 ecs-08d1a609999f93abf vector[14201]: 2023-11-08T01:01:01.325230Z ERROR sink{component_kind="sink" component_id=log_forwarder component_type=socket component_name=log_forwarder}: vector::internal_events::common: Unable to connect. error=Connect error: Connection refused (os error 111) error_code="failed_connecting" error_type="connection_failed" stage="sending" internal_log_rate_limit=true
Nov 08 01:01:01 ecs-08d1a609999f93abf vector[14201]: 2023-11-08T01:01:01.887022Z ERROR sink{component_kind="sink" component_id=log_forwarder component_type=socket component_name=log_forwarder}: vector::internal_events::socket: Internal log [Error sending data.] is being suppressed to avoid flooding.
Nov 08 01:01:01 ecs-08d1a609999f93abf vector[14201]: 2023-11-08T01:01:01.887052Z ERROR sink{component_kind="sink" component_id=log_forwarder component_type=socket component_name=log_forwarder}: vector_common::internal_event::component_events_dropped: Internal log [Events dropped] is being suppressed to avoid flooding.
Nov 08 01:01:01 ecs-08d1a609999f93abf vector[14201]: 2023-11-08T01:01:01.889805Z ERROR sink{component_kind="sink" component_id=log_forwarder component_type=socket component_name=log_forwarder}: vector::internal_events::common: Internal log [Unable to connect.] is being suppressed to avoid flooding.
and just to confirm, Vector has been running the whole time as far as systemd is concerned:
$ systemctl status vector
â—Ź vector.service - Vector
Loaded: loaded (/lib/systemd/system/vector.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2023-10-18 21:17:47 UTC; 2 weeks 6 days ago
As far as the connection errors themselves, they're valid and come about due to an instance behind a load balancer becoming unhealthy at certain times.
Now that I think about it, I suppose it's just the case that when the socket connection is reset, that causes the metric buffer to be reset, even though Vector itself hasn't restarted. It's just a bit misleading because the only errors we get from that sink are ones that come about from a connection reset rather than "handled" errors that would presumably cause the metric to increase monotonically?
Thanks for the additional information @marcus-crane! Could you share the error logs for the Docker Logs source as well? I would like to confirm what errors the Docker Logs source is encountering. Thank you!
A note for the community
Problem
Hi there,
I'm currently in the process of migrating our production logging pipeline over to Vector and as part of that, I've been setting up some dashboards.
During this, I noticed that the
vector.component_errors_total
metric appears to be inconsistent, with it acting monotonically (or not) depending on how you filter it.The metric, within Datadog which we use, is a
count
and count metrics are not required to be monotonic to be clear.As an example, here is the aforementioned metric in Datadog, showing errors for the
docker_logs
source:The metric, filtered in this way, acts monotonically.
This is in contrast to the
socket
sink which we are currently using:Here, the metric does not act monotonically.
This actually tripped me up initially, as I had set up some metrics for the
socket
sink and then, with my mental model being that these are not monotonic metrics, I interpreted thedocker_logs
metric as saying the Docker daemon is on fire.It is possible to get a more accurate view by applying the monotonic_diff function to a monotonic metric (like
docker_logs
errors) but that function is not usable in most handy non-timeseries views such as Query Value.Ideally, I think I would prefer that all metrics were consistently non-monotonic but mainly I'm just curious to know if this has come up, as I haven't discovered anything through searching.
Configuration
Version
vector 0.33.0 (x86_64-unknown-linux-gnu 89605fb 2023-09-27 14:18:24.180809939)
Debug Output
Example Data
N/A
Additional Context
We run Vector on Ubuntu AWS EC2 hosts where it receives logs from a Docker Daemon configured to run AWS ECS workloads.
References
No response