Closed blt closed 2 years ago
To add on to this, what we're seeing here may not be isolated to this configuration but simply easy to reproduce with it. We have started to see soak runs fail because of NaN exceptions. If you pull the underlying capture data what's evident is Vector will accept load briefly and then soft-lock. For instance, this soak run fails but when pulling the underlying capture files for fluent-elasticsearch you see:
So, Vector ran load the whole time. On the other hand you can find samples that look like this:
This is http-to-http-noacks from the same linked run.
All of this is to say there seem to be related problems with other configs but it's not clear if the problem identified above is unique to this config or is a symptom of a general problem.
As part of this issue, we should also re-enable the http-to-http-noacks
soak disabled by https://github.com/vectordotdev/vector/pull/11712. I'm hopeful that the underlying issue is the same. If it isn't, we can split off that soak into its own issue.
As part of this issue, we should also re-enable the
http-to-http-noacks
soak disabled by #11712. I'm hopeful that the underlying issue is the same. If it isn't, we can split off that soak into its own issue.
Ah, yes, thank you for calling that out. Totally spaced on doing so.
To share more details in this on-going investigation, we've arrived at
LimitedSender/LimitedReceiver
wherein semaphore permits are not freed for some reason, preventing any further progress. This was tested on 9aabd3632, a recent edition of master
. The only modification, beyond debug statements, is arbitrarily changing CHUNK_SIZE
to 777
to confirm a suspicion. By default, CHUNK_SIZE = 1000
, and lock up should still occur.
Run the following commands for vector
and lading
:
VECTOR_THREADS=8 firejail --noprofile --cpu=2,3,4,5,6,7,8,9 target/release/vector -c .tmp/pipeline-lock-up/http-to-blackhole.toml
firejail --noprofile --cpu=1 target/release/http_gen --config-path .tmp/pipelines-lock-up/http-to-http-noack.yaml
With Lading generating very large payloads in this setup, Vector locks up almost immediately (evident in the internal metrics or the info logs showing blackhole received events).
We've likely ruled out a few potential issues:
pipeline
entirely removed. While there might be other issues with pipeline
included, it doesn't seem involved in this particular lock up situation.http
source
http
source. The request will be processed up to the point that out.send_batch(events)
is called. Then, things stick, and you're left waiting for a response. So far, lock up seems like a race condition/deadlock problem. The current candidate for investigation is LimitedSender/LimitedReceiver which leverage semaphores to coordinate work, though it's still unclear why permits might not be freed.
If you compile the project with tokio-console
bits included
RUSTFLAGS="-Cforce-frame-pointers=yes --cfg tokio_unstable" CARGO_PROFILE_RELEASE_DEBUG=2 cargo build --release --features "tokio-console"
And run until a lock up occurs, you'll see lines in console
like the below, describing semaphore resources that appear to forever idle on acquiring permits.
tokio::sync::semaphore Semaphore ✅ /home/ubuntu/dd/vector/lib/vector-buffers/src/topology/channel/limited_queue.rs:293:46 permits=15545
One hypothesis from @blt is that the underlying ArrayQueue
structure's push
/pop
functions might not be appropriately synchronized when reading from the internal head
of the queue, leading to a race condition. Or, in general, there might be an issue in the mechanism behind the writer acquiring permits, pushing data into the internal queue, and awakening the reader to pop data from the queue, freeing the permits.
lading
config to generate larger payload sizes (i.e. with more events) was recent and seems to consistently and quickly cause lock up. Perhaps there's a clue there.One hypothesis from @blt is that the underlying
ArrayQueue
structure'spush
/pop
functions might not be appropriately synchronized when reading from the internalhead
of the queue, leading to a race condition. Or, in general, there might be an issue in the mechanism behind the writer acquiring permits, pushing data into the internal queue, and awakening the reader to pop data from the queue, freeing the permits.
I went for a walk and thought about this some more. I think it's possible that the waker notification will arrive before data is available in the underlying queue but I'm not sure that this would cost anything more than additional polls. So, more latency but not a full lock up.
Oh, it might also be interesting to see if you can replicate the problem with #11732.
A note for the community
Problem
As a part of my work on #10144 I've recently started to notice that vector -- see config below -- reliably locks up under load. When I say "lock" I mean that the load generation from lading halts as Vector signals back-off, blackhole sink no longer receives from upstream in the topology. Something important somewhere in Vector fills up and the program halts work.
Using the configuration files below you may reproduce this like so:
From the lading root:
Vector's blackhole will print and then taper off.
Configuration
The referenced bootstrap log file is: