vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.11k stars 1.6k forks source link

Vector stops processing http source requests after 12-24 hours of production run #12068

Open diegolbx opened 2 years ago

diegolbx commented 2 years ago

A note for the community

Problem

We are running Vector. latest 0.20 version, on ubuntu 22.04 lxd container. Vector is configured with 3 different types of sources (syslog + nginx + http) and 2 sinks (newrelic + humio).

After 12-24 hours of continuous operation, about 6GB/day http log injection rate, Vector http source stops responding. Process is up and syslog+nginx source/sink are running perfectly. We have about 20 clients concurrently connecting to vector http source at any moment. Factoring logging concurrency, there may be more than 100+ connections to http source at any moment.

We did a simple telnet port test on http source port and got disconnected immediately. Strace on vector while this telnet test is performed shows no activity within vector and likely a mutex deadlock?

strace: Process 1172 attached
futex(0x7f9a4ffdf040, FUTEX_WAIT_PRIVATE, 1, NULL

Configuration

We are running several sources. 1) Syslog udp 2) Nginx 3) Http ndjson

When http source stops responding, both syslog and nginx sources are working along with relevant sinks. The only issue is http source tcp acceptor appears to be deadlocked? and not responding to new requests.

Below is our http config, we stripped out working syslog/nginx configs.

[api]
enabled = true

[sources.http]
type = "http"
address = "0.0.0.0:8080"
encoding = "ndjson"

[transforms.http_parse_json]
type = "remap"
inputs = [ "http" ]
source = '''
del(.path)
del(.source_type)
del(.app)
'''

[sinks.new_relic]
  type = "new_relic_logs"
  inputs = ["http_parse_json"]
  license_key = "xxx"

[sinks.humio]
inputs = ["http_parse_json"]
type = "humio_logs"
encoding.codec = "json"
host = "https://cloud.community.humio.com"
token = "xxx"

Version

0.20

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

Qubitium commented 2 years ago

We have a written a script to monitor the HTTP source port on Vector instance every 1m and restart if 200 code is not returned on an empty json {} packet.

Notice the frequency at which the HTTP vector soure locks up. Based on our logs the lockups are much more frequent than previous thought at a rate of 1 per hour on avg.

@jszwedko We don't know what more info we can provide to the vector team can fix this pretty serious bug. We are willing to compile custom debug builds if required. Just let us know.

The vector HTTP source is handling about 15GB of uncompressed json on a daily basis for reference. Again, non-HTTP, SYSLOG and other sources are not affected. They keep working even when HTTP source locks up.

EDIT: We just updated to 0.21 release and holding to a thread of hope this version may fix our issue.

Tue Apr 19 01:11:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 01:12:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 02:51:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 02:52:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 04:26:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 05:52:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 07:43:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 07:44:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:02:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:03:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:04:02 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:28:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:29:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 09:05:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 09:06:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 09:06:01 AM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 03:10:02 PM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 03:11:01 PM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 05:01:01 PM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 05:02:01 PM UTC 2022 vector http died: 000 restart vector
Tue Apr 19 08:32:01 PM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 12:19:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 12:20:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 02:16:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 02:32:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 02:33:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 03:21:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 03:22:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 04:11:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 04:12:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 08:05:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 08:06:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 08:07:01 AM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 05:48:01 PM UTC 2022 vector http died: 000 restart vector
Wed Apr 20 05:49:01 PM UTC 2022 vector http died: 000 restart vector
Thu Apr 21 12:41:01 AM UTC 2022 vector http died: 000 restart vector
Qubitium commented 2 years ago

After updating to 0.21 and observing for few hours we can verify no changes and HTTP source port hanging issue still persists at random intervals.

jszwedko commented 2 years ago

Hi @diegomontoya !

Thanks for this report. We haven't yet been able to dig into it, but I appreciate your willingness to run a custom build for debugging. If we aren't able to reproduce, that would definitely be helpful.

jches commented 2 years ago

I believe I am running into this or a very similar bug as well, but with a kafka source. I'm running v0.23.3 with a few bugfix patches that I backported to the 0.23.x branch.

strace shows the same output:

strace: Process 18730 attached
futex(0x7f52b7a79248, FUTEX_WAIT_PRIVATE, 1, NULL

One vector thread is stuck at 100% CPU, the source is not polling kafka to consume new messages, and the client eventually times out. So it looks like the source future stops getting polled for some reason.

I have a similar remap transform as shown in the OP, and a kafka sink. Might there be a mutex somewhere along the topology that is somehow not getting released? Very happy to try patches or investigate a fix if someone can point me in a direction to look in the code!

scirner22 commented 2 years ago

The problem described here seems similar to the behavior I see with a vector sink as well https://github.com/vectordotdev/vector/issues/14932.

neuronull commented 1 year ago

Excluding the latest comment, there appears to have been two distinct reported occurrences of this issue.

The common denominator between them is the remap transform.

arouene commented 1 year ago

Same problem here, with a journald source and a datadog sink, after a while we are experiencing a "dead lock" :

# strace -p 2451662
strace: Process 2451662 attached
futex(0x7fdaa90d84e0, FUTEX_WAIT_PRIVATE, 1, NULL

restarting the service doesn't help.

arouene commented 1 year ago

I was able to resolve my problem by reducing the batch size in the sink configuration. Here's the updated configuration that worked for me:

[sinks.datadog]
type = "datadog_logs"
...
batch.max_bytes = 42500
batch.max_events = 100

It's worth noting that I have other machines with similar configurations and more logs, but they are working perfectly fine without the batch configuration. It seems that the issue is specific to this particular machine's setup :man_shrugging:

kinghrothgar commented 1 year ago

I also ran into this issue with 0.28.1. It appears that the default batch settings for the datadog sink can lead to vector hanging. I set mine to batch.max_events = 10000 and it appears to have fixed the issue.

@arouene as far as why only certain machines are experiencing it, from my testing, it requires there to be a spike in events. By default, a batch is triggered when either 2 seconds have passed or 100000 events have been collected. I'm guessing the machines that weren't working have spikes of events that were resulting in whatever batching bug we're hitting is.

pront commented 1 year ago

An attempt to summarize the above: the http source causes a Vector hang (deadlock?) after receiving big spikes of events over a period of 12+ hours.