Open diegolbx opened 2 years ago
We have a written a script to monitor the HTTP source port on Vector instance every 1m and restart if 200 code is not returned on an empty json {}
packet.
Notice the frequency at which the HTTP vector soure locks up. Based on our logs the lockups are much more frequent than previous thought at a rate of 1 per hour on avg.
@jszwedko We don't know what more info we can provide to the vector team can fix this pretty serious bug. We are willing to compile custom debug builds if required. Just let us know.
The vector HTTP source is handling about 15GB of uncompressed json on a daily basis for reference. Again, non-HTTP, SYSLOG and other sources are not affected. They keep working even when HTTP source locks up.
EDIT: We just updated to 0.21 release and holding to a thread of hope this version may fix our issue.
Tue Apr 19 01:11:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 01:12:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 02:51:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 02:52:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 04:26:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 05:52:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 07:43:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 07:44:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:02:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:03:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:04:02 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:28:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:29:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 09:05:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 09:06:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 09:06:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 03:10:02 PM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 03:11:01 PM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 05:01:01 PM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 05:02:01 PM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:32:01 PM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 12:19:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 12:20:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 02:16:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 02:32:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 02:33:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 03:21:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 03:22:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 04:11:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 04:12:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 08:05:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 08:06:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 08:07:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 05:48:01 PM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 05:49:01 PM UTC 2022 vector http died: 000 restart vector
Thu Apr 21 12:41:01 AM UTC 2022 vector http died: 000 restart vector
After updating to 0.21 and observing for few hours we can verify no changes and HTTP source port hanging issue still persists at random intervals.
Hi @diegomontoya !
Thanks for this report. We haven't yet been able to dig into it, but I appreciate your willingness to run a custom build for debugging. If we aren't able to reproduce, that would definitely be helpful.
I believe I am running into this or a very similar bug as well, but with a kafka source. I'm running v0.23.3 with a few bugfix patches that I backported to the 0.23.x branch.
strace
shows the same output:
strace: Process 18730 attached
futex(0x7f52b7a79248, FUTEX_WAIT_PRIVATE, 1, NULL
One vector thread is stuck at 100% CPU, the source is not polling kafka to consume new messages, and the client eventually times out. So it looks like the source future stops getting polled for some reason.
I have a similar remap
transform as shown in the OP, and a kafka sink. Might there be a mutex somewhere along the topology that is somehow not getting released? Very happy to try patches or investigate a fix if someone can point me in a direction to look in the code!
The problem described here seems similar to the behavior I see with a vector sink as well https://github.com/vectordotdev/vector/issues/14932.
Excluding the latest comment, there appears to have been two distinct reported occurrences of this issue.
The common denominator between them is the remap transform.
Same problem here, with a journald source and a datadog sink, after a while we are experiencing a "dead lock" :
# strace -p 2451662
strace: Process 2451662 attached
futex(0x7fdaa90d84e0, FUTEX_WAIT_PRIVATE, 1, NULL
restarting the service doesn't help.
I was able to resolve my problem by reducing the batch size in the sink configuration. Here's the updated configuration that worked for me:
[sinks.datadog]
type = "datadog_logs"
...
batch.max_bytes = 42500
batch.max_events = 100
It's worth noting that I have other machines with similar configurations and more logs, but they are working perfectly fine without the batch configuration. It seems that the issue is specific to this particular machine's setup :man_shrugging:
I also ran into this issue with 0.28.1
. It appears that the default batch settings for the datadog sink can lead to vector hanging. I set mine to batch.max_events = 10000
and it appears to have fixed the issue.
@arouene as far as why only certain machines are experiencing it, from my testing, it requires there to be a spike in events. By default, a batch is triggered when either 2 seconds have passed or 100000
events have been collected. I'm guessing the machines that weren't working have spikes of events that were resulting in whatever batching bug we're hitting is.
An attempt to summarize the above: the http source causes a Vector hang (deadlock?) after receiving big spikes of events over a period of 12+ hours.
A note for the community
Problem
We are running Vector. latest 0.20 version, on ubuntu 22.04 lxd container. Vector is configured with 3 different types of sources (syslog + nginx + http) and 2 sinks (newrelic + humio).
After 12-24 hours of continuous operation, about 6GB/day http log injection rate, Vector http source stops responding. Process is up and syslog+nginx source/sink are running perfectly. We have about 20 clients concurrently connecting to vector http source at any moment. Factoring logging concurrency, there may be more than 100+ connections to http source at any moment.
We did a simple telnet port test on http source port and got disconnected immediately. Strace on vector while this telnet test is performed shows no activity within vector and likely a mutex deadlock?
Configuration
We are running several sources. 1) Syslog udp 2) Nginx 3) Http ndjson
When http source stops responding, both syslog and nginx sources are working along with relevant sinks. The only issue is http source tcp acceptor appears to be deadlocked? and not responding to new requests.
Below is our http config, we stripped out working syslog/nginx configs.
Version
0.20
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response