Open johnhtodd opened 5 months ago
Hey John, what's the dnstap
source ?
error=bytes remaining on stream
sounds to me like vector thought there were errors in the frame format
Hey John, what's the
dnstap
source ?
error=bytes remaining on stream
sounds to me like vector thought there were errors in the frame format
The source is dnsdist/dnstap (latest release.) It may be the case that dnsdist/dnstap sends an error of some sort, but that should be only for one event. And vector goes many days without falling into this error condition, with the same load all the time. It would be highly unexpected for dnsdnst/dnstap to suddenly shift to an "all-errors" modality, since there is no backwards data sent from vector to dnsdist - it should be just "ack" packets. Also, the fact that the stats don't seem to change, just the error counters rise, seems to be unusual and points away from dnstap being the source of the issue. Lastly, there is no increase in packet or byte count between the DNSTAP origin and this vector host as indicated via my node graphs, so I don't think this is a dnsdist-side issue, or at least not the ongoing large jump in error counters.
Update: In my last event instance today, "systemctl restart vector" did clear the error condition.
Looking at the code, I think @james-stevens is right - it comes from the frame reader. The odd thing to me here is that there is no other effect than higher error count, since this should pretty much stop any events from flowing - error frames are just skipped.
We've been running dnsdist
->vector
using the dnstap
source and had no framing issues up to 200K event/s - the problem we have is we've not been able to find any vector sink that can sink the data at anywhere near that rate, except blackhole
!!! https://github.com/vectordotdev/vector/issues/20739
So what we get is back-pressure at the sink, which causes dropped frames at the dnsdist
end of the dnstap
socket. dnsdist
will deliberately drop frames instead of blocking as that would disrupt the DNS service. But it doesn't cause framing issues.
dnsdist
only makes a single dnstap
connection to vector, for all its threads - whereas PowerDNS Recursor makes one dnstap
per-thread - obv the latter makes scaling much easier!!
Our servers have plenty of spare capacity & none of the vector threads are over 20% CPU, so its really not clear where the bottleneck is - but that's a different problem!!!
Thanks @james-stevens. Do you have a way of testing dnstap
in unix
mode? Just to help pinpoint the issue. If it only occurs in tcp
mode then this and #20739 might be related.
Another point to notice that I haven't mentioned: the number of errors, once started, seems to be remarkably "flat" despite the number of events being sent to the DNSTAP source fluctuating quite a bit, at least according to what is being recorded by the prometheus metrics in Vector. We shift from 16k events per second to 26k events per second being sent by dnsdist to the DNSTAP consumer, but the error rate shown in the graphs above stays nearly exactly at 18.2k errors per second - the values are not as important as the fact that there is no variation in the error rate despite variation in the dnstap message transmission rate. Just as weird, the rate of events per second downstream from the dnstap ingestion remap fluctuates normally with volume, so the rest of the system does not "see" these errors that are triggering faults that are showing up in the graph associated with vector_component_errors_total.
what metric(s) are you looking at in that graph?
we're not getting the metric vector_component_errors_total
which I suspect is cos we have expire_metrics_secs
set to five mins - so if we're seeing no errors that metric could just get removed.
What I check is that vector_buffer_sent_events_total
matches what I expect to the appropriate sink.
Yes, the metric is "vector_component_errors_total". Our expire_metrics_secs is set to 22265 (yes, very long because we have metric aggregations that go for >6 hours in some cases.)
I can only see errors related to scraping metrics when under load - we use vector to aggregate all the metrics on the server so Prometheus doesn't need to be reconfigured every time we change things about.
We also had a few "disk full" incidents when running 300Kq/s tests for five mins. 350MB/s disk writes adds up fast.
The way we look for dropped frames is to get dnsdist
to log - its one of the items it logs.
We then wrote some LUA to parse the log line & convert it into three new dnsdist metrics - dnstap frames sent, dropped & failed. No idea the difference between those last two, but we don't see any failed, just dropped when the load gets too high.
A note for the community
Problem
After a period of time, a heavily-loaded DNSTAP ingestion system shows socket errors on connected sockets that are transmitting DNSTAP data to it. Uncertain if this is a reporting error or a real error.
Configuration
Version
vector 0.39.0 (x86_64-unknown-linux-gnu)
Debug Output
Example Data
No response
Additional Context
This particular system I am testing on has two high-volume (>14kqps each) ingestion streams and two low-volume streams (~30qps each) connected to two different dnsdist instances and feeding two different contexts. After some period of time, errors will jump from zero to around 180,000 per second (which doesn't make sense? How can there be more errors than ingestion elements? I'm graphing with: "irate(vector_component_errors_total[5m])" in Prometheus/Grafana) on the high-volume context that is importing from the dnstap source. This I suspect is one of the two server sockets showing signs of the problem. Then after a random number of hours (often measured in days) the number of errors will jump to around double that amount. (see graph, which shows one stream as being "bad" for several days, then a spike when the other large stream starts showing errors.)
Strangely, I see no evidence of this increased error rate on any of the downstream components, either looking in their graph data (I graph pretty much everything coming out of Vector) nor on the actual output generated at the end of the pipelines. Are these errors real? The debug messages certainly seem to indicate that there is a problem.
Other items of note: Reloading vector does not cure the problem. Even more unusually, "systemctl restart vector" also does not cure the problem. Only "systemctl stop vector;systemctl start vector" causes the error graph to drop to zero and the error messages to stop being generated.
This machine is fairly heavily loaded, performing only Vector tasks (>85% utilization on htop across all cores at some parts of the day.)
There are other DNSTAP parsing errors seen sporadically, but they seem to be related to malformed DNS packets or elements in the DNSTAP message that are not yet parsed fully. I did not include those log lines.
I have other protobuf sockets operating on this system that are fairly busy (four at ~3kqps each) but which are not DNSTAP. I also have many v6 kafka streams as sinks.
References
No response