vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.41k stars 1.51k forks source link

vector only sinks a tiny amount of data coming from a dnstap source, except if you sink to blackhole! #20739

Open james-stevens opened 2 months ago

james-stevens commented 2 months ago

A note for the community

Problem

We are feeding dnstap data to vector from a number of DNS servers. Depending on the sink type used we see different amounts of data recorded, but never anywhere near what we would expect, even where there is ample spare capacity on the servers to record all the data.

Our plan had been to run vector on all DNS Servers and aggregate the data to a central location using a vector running there using the vector to vector source & sink. This would only record up to about 25K events/sec. As a test we reconfigured the DNS server's vector to first record the data to local files, then just blackhole it.

No errors or messages are ever logged in vector or the DNS software (dnsdist). All that happens is that the dnstap data failed to be recorded, but where it gets dropped is less clear.

With the sink set to blackhole in the local vector it will successfully sink 400K events per second. With the sink set to file (on the local machine it will successfully sink just under 80K events per second. With the sink set to vector the remote vector will successfully record about 25K events per second.

In all cases the source was dnstap, as per the config. We only tested blackhole to 400K/s, it may well go higher.

Running vector on the DNS server and saving to local files, with gzip compression enabled, we're seeing well under 5MB/s of disk writes and about 6% CPU utilisation (we have 64 cores), but we are still unable to get vector to successfully record more than 80K events per second.

Reconfiguring the local vector to send to a remote vector, using the vector source & sink, we can't successfully record more than 25K events per second. In this situation the remote machine will be saving to local files with gzip compression.

When connecting dnsdist to vector using a TCP socket, we saw a maximum of 2.6Mb in either the transmit or receive queues of the TCP connection, with a [rw]mem max & default setting of 25Mb.

Apart from changing the end sink nothing seemed to have a noticeable effect on the amount of data we successfully recorded. At all times all servers seem to have ample spare capacity with all having 64 cores, 128GB RAM & a 4 disk SSD RAID-10 and 25GB/s ethernet.

With log-level set to debug, we saw lines like this logged

Jun 25 11:31:40 vector[741405]: 2024-06-25T11:31:40.266817Z DEBUG sink{component_kind="sink" component_id=dnstap_to_file component_type=file}: vector::utilization: utilization=0.0005924399138748692

Under load utilization would climb & peak at 0.98

We are somewhat hampered in debugging this issue by the fact the dnstap source does not seem to record any metrics for events or bytes on the input.

We have tried everything we can think of to get more performance out of vector, but have failed and the vector tuning documentation simply says there are no tuning options.

With ample spare capacity on all servers, I would expect vector to be able to perform at a much higher level, up to the point we can see it hit one of the limitations of the hardware itself.

Configuration

api:
  enabled: true
  address: 127.0.0.1:8686

expire_metrics_secs: 300

sources:
  vector_metrics:
    type: internal_metrics

  services_metrics:
    type: prometheus_scrape
    scrape_interval_secs: 15
    scrape_timeout_secs: 2
    endpoints:
      - "http://127.0.0.1:9100/metrics"
      - "http://127.0.0.1:9153/metrics"

  dnstap:
    type: dnstap
    mode: tcp
    address: 127.0.0.1:9053
    multithreaded: true

sinks:
  output_my_prom:
    type: prometheus_exporter
    address: [some-ip-address]:9100
    inputs:
      - vector_metrics
      - services_metrics

  dnstap_to_file:
    inputs: [ dnstap ]
    type: file
    path: "/var/lib/vector/dnstap/%Y/%m/%d/%H/dnstap_%Y%m%d-%H%M.log.gz"
    compression: gzip
    buffer:
      max_events: 1000000
    healthcheck:
      enabled: true
    framing:
      method: "newline_delimited"
    encoding:
      codec: json
      metric_tag_values: full

Version

0.37.0

Debug Output

No response

Example Data

No response

Additional Context

We are running vector on bare-metal servers on RHEL9 using the RPM provided.

References

No response

jszwedko commented 2 months ago

Thanks for this issue @james-stevens ! A couple of quick thoughts:

james-stevens commented 2 months ago

thanks for your reply, what you say matches the conclusion I came to - which was basically the back-pressure was causing the packets to get dropped at the dnstap interface (dns->vector) as the DNS server was performing fine & responding to all queries but, if it couldn't record the query & response, then its going to be using non-blocking i/o and just dropping the packets at the dnstap interface.

My question is why is the back-pressure occurring when all servers have HEAPS of spare hardware resources?

The file i/o is topping out at writing 5MB/s which is prob less than 5% of what I can get out of that disk array. I bet most USB sticks could out-perform that, high end USB-3 top out at 150MB/s, tho 80MB/s is more common.

Screenshot 2024-06-26 at 09 20 56

Run a perf test on your laptop hard disk and tell me its not at least 10x faster than that 😉

they can't write data as fast as the source is collecting it

right, but the question is "why not?" cos its not a limitation of the hardware its running on, so it MUST be a limitation within vector.

When using the vector sink to the remote machine, we also had disk buffering on the DNS server's vector, and would see it catch up for some time after.

Interesting, the remote/central vector seemed to have a single through-put limit for all incoming data - so when one DNS server's vector had finished flushing its data, the others would speed up, maintaining a constant total input of just under 50K/s (at the central vector).

So how do we improve the performance of the file sink?

BTW: the reason we started using the file sink was because we had been using the S3 sink, but were seeing very similar amounts of lost data. At 400K queries/s we lost over 95% of the data, prob more like 99%.

I think the next test we will do it set the central vector to sink to blackhole and see what happens.

Vector is meant to be a high performance data collection tool, but its not working like that for us. Its only collecting relatively small amounts of data, often losing over 90%, and working massively under the capabilities of the hardware its running on, but the "Tuning" documentation says there's nothing we can do to improve that situation.

what gives?

james-stevens commented 2 months ago

Screenshot 2024-06-26 at 11 50 11 With this test, we reverted to sending the dnstap data from vector on the DNS servers to the vector aggregator using the vector-to-vector config, but also setting request.adaptive_concurrency.max_concurrency_limit to 1000 on the DNS servers and setting the vector aggregator to blackhole the data.

We sent 50K queries/s to the two DNS servers, which should have received 25K q/s each, creating 50K dnstap events/s each (query & response).

As you can see, the vector sink (on the DNS servers) processed a pretty steady 30K event/s, not the expected 50K, and the vector aggregator seemed to see most of that data, although not very smoothly !

When the test stopped the DNS servers flushed their buffers to the aggregator for a small amount of time

The CPU on the vector aggregator was about 97% idle and each DNS server was about 92% idle, with the buffering to disk causing a disk i/o of 42MB/s, proving vector can indeed get more than 5MB/s disk i/o on these servers.

We then changed vector on the DNS servers to once again save to local files, then to save to a RAM disk - literally ZERO difference. The only thing that's made any difference so far is

  1. blackhole the data on the DNS servers
  2. we disabled "sentinal-one" (a linux security tool) and gained about 30%, from 30K events/s to 40K events/s

Screenshot 2024-06-26 at 14 30 14

Here are the two DNS Servers (with S1 disabled) running three different test, from left to right

  1. Send the data to a separate vector server (vector-to-vector), where it is blackholed
  2. Save to local files on the DNS server's SSD RAID
  3. Save to local files in a RAM Disk on the DNS Servers

As you can see, within the margin of error, despite three EXTREMELY different sinks, the amount of data reported going through the sink is identical.

This result is simply not possible without some kind of deliberate throttle within vector, IMHO.

But there's no documentation of such a throttle, or any options to tune it.

Remember: when sinking to a local blackhole, we saw 200K events/s on each of these two DNS Servers.

Something VERY odd is happening in vector for which I have failed to find any visibility (metrics etc) or explanation.

At all times we are collecting metrics from the operating system, the DNS s/w & vector using Prometheus.

james-stevens commented 2 months ago

By switching the file format to NOT compressed & extending buffer.max_events to 3 million, we've been able to get 72K events/s recorded locally to disk on each of the two DNS servers.

But any higher & the DNS s/w starts reporting dropped packets & vector starts buffering a back-log.

Also, that's about 20Gb of data for a five minute test, so saving uncompressed isn't really an option, but if we re-enable compression, we're back to the 40K events/s seen above.

james-stevens commented 2 months ago

Screenshot 2024-06-27 at 12 21 25

Here's vector top and htop running on both DNS Servers (130 & 131). We're sending 80K queries/s and writing the data uncompressed to local text files on the SSD RAID, so we'd be expecting about 80K event/s on both servers.

As you can see, both machines (but especially 131) are not being able to write out the data (events/out) as fast as dnstap is producing the events/out - highlighted with the red ring. Eventually this results in vector failing to read the dnstap data and the DNS s/w (dnsdist) reporting dropped frames on the dnstap.

However, you can also see nearly all the vector threads are in (S)leep mode, none of them are over 20% busy, and very few of the cores are more than 10% utilised. Also: "Tasks: ...., 0 running" on both servers.

Also NOTE: on RHEL9, in vector top, events/in for dnstap is N/A, this is NOT what you see for timberio/vector:0.39.0-debian which shows a value instead .... maybe the build process for vector on RHEL9 has problems ????

james-stevens commented 2 months ago

CONCLUSION

After extensive testing the conclusion we have come to is that each vector sink has an internal hard-coded event/s limit, which is neither documented nor tuneable and is a limit vector will impose regardless of the capabilities of the underlying hardware it is running on, or sinking to.

While it may be that well known public services have published limits on the rate they will accept events, making it useful for vector to impose such limits, not documenting this in vector, or making it visible to the user, still seems unnecessarily overly protective.

This assessment may not be correct, but this is definitely how it behaves.

sink limit (events/s)
blackhole 200K
file/uncompressed 70K
file/compressed 40K

The problem this causes us is that the back-pressure this artificial limit causes is eventually fed back to the input, where the data is simply dropped. This is the normal behaviour for dnstap when the destination will not read the input fast enough. This would be fine, if vector was working as hard/fast as the hardware allowed but, more often than not, it is operating NOWHERE NEAR the limits of our hardware (see above).

RECOMMENDATION

These limits should be both documented & tuneable.

jszwedko commented 2 months ago

Thanks for this investigation @james-stevens

has an internal hard-coded event/s limit, which is neither documented nor tuneable and is a limit vector will impose regardless of the capabilities of the underlying hardware it is running on, or sinking to

Note that there is no explicit rate limiting / throttling happening in sinks. I think what you are stumbling upon is that the sinks you are trying, in particular the file sink, are not as efficiently implemented as it could be. The blackhole sink seems like it should be able to achieve much higher throughput than you are seeing too: there is likely some inefficiencies in that sink too.

Our general aim, with Vector, is to have most configurations run at at least 10 MB / s / core at minimum. It certainly seems like your configuration is running significantly below that. I think we would need to do some benchmarking and performance analysis to improve it. It is possible there is some bottleneck.

jszwedko commented 2 months ago

You could try playing with --threads. By default it is set to the number of available CPUs, but maybe a higher or lower number may result in improved performance for you.

jszwedko commented 2 months ago

Something I noticed is that the file sink writes one event to the file at a time 😨 This seems like the throughput would greatly benefit from batching the writes.

https://github.com/vectordotdev/vector/blob/def8b7f251ec60acc4ec673f5c6acc4065e99e18/src/sinks/file/mod.rs#L378-L399

jszwedko commented 2 months ago

Opened this issue to track adding batching to the file sink: https://github.com/vectordotdev/vector/issues/20784

james-stevens commented 2 months ago

Thanks you for all your replies - they are really appreciated.

You could try playing with --threads. By default it is set to the number of available CPUs, but maybe a higher or lower number may result in improved performance for you.

We tried both higher & lower - higher def hurt performance, but not by a lot, lower made no difference.

It doesn't feel like inefficiencies in the code, cos we're not hitting any limits of our hardware, by a VERY LONG WAY.

It's also odd that the numbers are so consistent - no matter what we try - including splitting the system into two separate servers!!!

dnsdist->vector->blackhole (all on one server) gives pretty much the exact same through-put as dnsdist->vector->network->vector->blackhole - 2 servers, with the second server dedicated to just running vector, same spec as described above.

As you described, its almost certainly back-pressure from the final sink just getting back-propagated through the stack.

What we're trying now is splitting the original queries into 5 instances of dnsdist which will pass their data to their own separate instance of vector.

johnhtodd commented 2 months ago

@james-stevens Sorry if you've already covered this, but what happens if you have multiple dnsdist servers sending traffic at high volumes? What happens if you do an aggregation on input? Both of those things may give you some insight.

I'd suggest trying an aggregation first, since that would potentially invalidate the necessity of doing a multi-dnsdist input test. If aggregation can absorb 400kqps, then this points more towards backpressure from the sink as the root cause, rather than some dnsdist ingestion problem.

james-stevens commented 2 months ago

Sorry if you've already covered this, but what happens if you have multiple dnsdist servers sending traffic at high volumes? What happens if you do an aggregation on input?

Its a good ask, interestingly, the thru-put on any one sink is the same - so the thru-put is limited by the number of sinks & type of sink - i.e. 100K event/s for blackhole, or 200K/s for 2 x blackholes (in 2 separate vectors).

I am absolutely convinced the issue we are seeing is back-pressure in the sink propagating all the way back to the dnstap, where packets get dropped.

We also tried dnstap from PowerDNS Recursor with the same results, although recursor has the ability to buffer on the dnstap output, so it can absorb some of the back-pressure for some time, but then obv then drops packets once the buffer is full.

So what we did yesterday was run a two-layer dnsdist, with one dnsdist receiving the client queries, then immediately spreading it across 5 child dnsdist, each with their own individual vector. This takes the number of sinks from 1 to 5 & resolves the issue.

We got it recording uncompressed files to disk at 275q/s (2 events per query - query & response) at about 275MB/s disk writes & ~35% CPU utilisation. Uncompressed files to disk got us ~70K/s before, so this is not quite 5 x 70, but close - obv there's some additional load from the new architecture.

There's also various tuning options we didn't look into like buffer sizes between the dnsdist & number of threads for the child dnsdist & child vector - set to 2 & 10 respectively.

I suspect if we up the number of child dnsdist, we'll get more thru-put & I see no disadvantage to doing this. I also think changing the parent dnsdist load balancing policy from default to round-robin will spread the load across the child dnsdist better.

More General Assessment of the Issue

If vector is doing synchronous i/o (or calling sync()) that could be the cause of what we are seeing.

A sync-write is in two parts - your code passes the data to the kernel & (when it feels like it) the kernel will schedule time to write it to disk. So your write() doesn't return until the kernel has found some spare time to flush to disk.

The effect will be your code will write slowly, but there will be no obvious hardware limit being hit as all that is happening is you are waiting (in real time) for the kernel to be bothered to write your data.

Sync I/O is commonly used to try & not lose data, but obv has draw backs.

This seems to tie into what we are seeing - thru-put is limited, but the server is almost idle.

james-stevens commented 2 months ago

With 9 dnstap & vector sub-process on each of two servers, saving locally to uncompressed files, we got up to 600K events/s (300K queries/s) before we saw dnstap frames reported dropped in the dnsdist log.

This is a small improvement over the previous limit of 275K we saw with 5 sets of sub-processes, but not as big a move as I would have hoped for.

We kept the 9 sub-process and then enabled gzip and then zstd. With both we hit the same limit of 300Kq/s (600K event/s) before seeing a very small number of dropped frames - a few hundred over the entire 5 min test, so ~0.001% - slightly fewer under zstd.

With gzip the CPU usage was extremely similar to uncompressed, but with zstd the CPU usage was about 10% lower. We saw 16:1 compression for gzip and 18.8:1 for zstd.

With zstd the total CPU usage was about 37%, RAM was 8Gb used, disk i/o was about 7MB/s & context switches were just under 1.2M/s - we've seen higher values in all these metrics, so its hard to know why it started dropping frames at this level.

It's also really odd that the limit of 300Kq/s (600K event/s) seems the same for uncompressed, gzip & zstd files, and switching from 5 to 9 sub-process pairs made no difference.

For gzip at 300Kq/s we saw some buffering in vector (we set a limit of 1M events in each sub-process), but for zstd there was almost no buffering.

So once again, we've reached the point where we seem to have spare hardware capacity, but vector isn't reading the dnstap frames fast enough so we're dropping frames and its not clear why. But 300Kq/s is within spec for what we are trying to achieve, so we're not going to put any more time into investigating this, except to test saving to S3 instead of local files.

For those unfamiliar: dnsdist is a DNS proxy for load balancing, failover & anti-D/DoS - kinda like haproxy, but for DNS!

johnhtodd commented 2 months ago

The limit of 600K events/s seems to indicate you've hit some other bottleneck on the system that is outside of Vector. However, the tests you've done do seem to imply that Vector is somehow the bottleneck, until you hit that system-level limit. What do you get with this new test model sending to blackhole instead of to a file? You got 2x last time - do you get 5x with blackhole? Does it go farther with more instances of dnsdist (which really means "with more sockets sending data")? In other words: what is the maximum ingestion rate that you think is possible to achieve, versus the maximum sink rate? If you can/can not get past the 600K events/s rate with blackhole, then maybe that gives more insight as to where at least one of the bottlenecks is. This probably doesn't address the root of the problems (sink backpressure) but it's probably worth doing.

More concerning is that even with your experiment with 2x blackhole sinks, you only reached 200K events/s. (or was that 100K number a hypothetical? Was that the actual volume you reached?) There should be no backpressure on a blackhole sink, or at least that would be my assumption (which could be false.)

james-stevens commented 2 months ago

which really means "with more sockets sending data")?

Its NOT the number of senders - with 1 dnatap sender to 1 blackhole we get the exact same rate at the sink as 2 senders to 1 blackhole - the issue is a limit on getting the data out at the sink causing back-pressure - which eventually propagates back to the dnstap, where frames get dropped .... which I suspect is caused by sync i/o at the sink, but that's a wild guess - it just looks a lot like a similar issue I had with a high data volume application.

So its not more instances of dnsdist that is the solution as such, but more instances of vector - but, if you're using dnsdist as the dnstap source, you can only achieve more vector destinations by also having more instances of dnsdist, as dnsdist only makes a SINGLE dnstap socket connection shared by all its threads.

or was that 100K number a hypothetical

all numbers quoted are measured using Prometheus+Grafana :)

With blackhole we achieved 100K per blackhole - we've not tried blackhole with the multi-sub-process set up as the business priority is to get to the target rate then move onto the next issue - for this application 300Kq/s (600K event/s) with zstd was the target & we've achieved it, so any further experimentation is prob not going to happen. The next step is saving to S3 rather than local SSD/RAID. First I'm going to try going direct to S3, but dumping to file, then getting vector scrape the files & put them in S3 could be a viable fall-back position.

There should be no backpressure on a blackhole sink

In theory, but @jszwedko does say its known that "unlimited" doesn't really mean that :)

I remember a vague issue about it not swallowing data as fast as it potentially could

Sadly, our problems are somewhat of our own making in that, had we been running dnsdist+vector in a VM or container, we'd just have run more of them and this would have been a huge non-issue - but the strategic decision was to run the applications directly in the o/s on bare metal, severely restricting any flexibility. But all that is WAY above my pay grade ;)

I'm also not going to be on this project for much longer as I'm change job in a couple of weeks to go & work for ICANN.

james-stevens commented 2 months ago

This is kinda a cute - these are the tests running 9 dnstap+vector pairs on each of two servers up to 300Kq/s - so 600K event/s. The numbers are stacked, so we can easily see the total.

Screenshot 2024-07-05 at 11 19 40

From left to right - 300Kq/s saving to uncompressed files on the SSD/RAID, then 100K, 200K & 300K q/s saving to gzip local files, then 300K & 275K saving to zstd local fles. The second 100K/gzip test was meant to have been 200K, so got aborted!

Top graph is events reported by vector's sink - vector_buffer_sent_events_total, bottom graph is written frames reported by dnsdist, using the metrics we added in LUA.

The saw-tooth in the vector graph is real - it saves the data to disk in big chunks every 30 seconds. We tried reducing idle_timeout_secs, but only cos it's default was 30 secs - it really didn't look like it was the issue there.

When you look at the disk i/o on the server, you can really see the 30s spikieness Screenshot 2024-07-05 at 11 31 28

If it flushed to disk more often / evenly that would prob help with thru-put - tho that could also be an O/S level issue on RHEL9 - RHEL has always seemed to perform like this, i.e. you save a ton to disk & it seems to happen immediately, then a few seconds later your entire system locks up as it gets flushed out.

Then network & CPU during those tests Screenshot 2024-07-05 at 11 35 10

NOTE: the load balancer doesn't always distribute the load completely evenly - this is probably cos the test data is coming from a small(er) number of sources than we'd get in the real world.

james-stevens commented 1 month ago

A good example of the oddity of this issue was when we send the data dnsdist->vector->AWS-S3 - with both dnsdist & vector running on the same server, with a single dnsdist sending dnstap to a single vector we achieved 300K event/s dropping <0.01% of the dnstap frames. The S3 sink def seems to be able sink at a pretty decent rate.

However, when we sent the data dnsdist->vector->vector->AWS-S3 we couldn't save more than about 60K event/s before we got back-pressure & saw frames dropped at the dnstap interface. In this set up dnsdist & its vector were running on one server (with 64 cores) and the other vector was running on its own server (another 64 cores) & a 25Gb/s link - so in total, double the hardware resources of the first scenario!

As previously, in the seconds scenario, there was no hardware limitation - CPU was <3% used on both servers.

Basically the buffers on the dnsdist server filled pretty quickly, until it was full (set to 3M events), then it just dropped frames at the dnstap interface.

dnsdist -> vector uses the dnstap source vector -> vector used the vector sink & source

jszwedko commented 1 month ago

Interesting. Did you try tweaking the concurrency of the vector sink (request.concurrency)?

james-stevens commented 1 month ago

Interesting. Did you try tweaking the concurrency of the vector sink (request.concurrency)?

Yes - and --threads, but it didn't change the behaviour at all.

We actually tried tuning quite a wide range of things, but the only tuning action we found that improved things was increasing the kernel options rmem_default, rmem_max, wmem_default & wmem_max, but generally they made percentages differences (say 15% to 20% increase) rather than orders of magnitude.

Apart from simply using a different sink, the only thing that really increased throughput was to split the traffic across multiple instances of vector, all running on the same server, as described above.

BTW: I now no longer work at the company experiencing this issue, so I will not be doing any more work it :) - but from 1st Aug I will be working at ICANN and they record all the DNS traffic from their ROOT server, so maybe I will end up revisiting this, but wearing a different hat - you never know :)