vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.3k stars 1.61k forks source link

DNSTAP socket shows errors after operating for some time #20744

Open johnhtodd opened 5 months ago

johnhtodd commented 5 months ago

A note for the community

Problem

After a period of time, a heavily-loaded DNSTAP ingestion system shows socket errors on connected sockets that are transmitting DNSTAP data to it. Uncertain if this is a reporting error or a real error.

Configuration

in this config, I use variables for configuration.  
VECTORIPV4=10.10.3.100
VECTORDNSTAPPORT=59001

  main_dnstap_1:
    type: "dnstap"
    mode: tcp
    address: "${VECTORIPV4:?err}:${VECTORDNSTAPPORT:?err}"
    permit_origin: ["${VECTORIPV4NETWORK:?err}", "127.0.0.1/32"]
    lowercase_hostnames: true

Version

vector 0.39.0 (x86_64-unknown-linux-gnu)

Debug Output

This is not a crash; backtrace not included, though if desired I can attempt it.

2024-06-24T15:18:42.687252Z ERROR source{component_kind="source" component_id=main_dnstap_1 component_type=dnstap}:connection{peer_addr=10.10.3.232:35402}:connection: vector::internal_events::tcp: TCP socket error. error=bytes remaining on stream peer_addr=10.10.3.232:35402 error_type="connection_failed" stage="processing" internal_log_rate_limit=true
2024-06-24T15:18:42.687482Z ERROR source{component_kind="source" component_id=main_dnstap_1 component_type=dnstap}:connection{peer_addr=10.10.3.232:35402}:connection: vector::internal_events::tcp: Internal log [TCP socket error.] is being suppressed to avoid flooding.

Example Data

No response

Additional Context

This particular system I am testing on has two high-volume (>14kqps each) ingestion streams and two low-volume streams (~30qps each) connected to two different dnsdist instances and feeding two different contexts. After some period of time, errors will jump from zero to around 180,000 per second (which doesn't make sense? How can there be more errors than ingestion elements? I'm graphing with: "irate(vector_component_errors_total[5m])" in Prometheus/Grafana) on the high-volume context that is importing from the dnstap source. This I suspect is one of the two server sockets showing signs of the problem. Then after a random number of hours (often measured in days) the number of errors will jump to around double that amount. (see graph, which shows one stream as being "bad" for several days, then a spike when the other large stream starts showing errors.)

Strangely, I see no evidence of this increased error rate on any of the downstream components, either looking in their graph data (I graph pretty much everything coming out of Vector) nor on the actual output generated at the end of the pipelines. Are these errors real? The debug messages certainly seem to indicate that there is a problem.

Other items of note: Reloading vector does not cure the problem. Even more unusually, "systemctl restart vector" also does not cure the problem. Only "systemctl stop vector;systemctl start vector" causes the error graph to drop to zero and the error messages to stop being generated.

This machine is fairly heavily loaded, performing only Vector tasks (>85% utilization on htop across all cores at some parts of the day.)

There are other DNSTAP parsing errors seen sporadically, but they seem to be related to malformed DNS packets or elements in the DNSTAP message that are not yet parsed fully. I did not include those log lines.

I have other protobuf sockets operating on this system that are fairly busy (four at ~3kqps each) but which are not DNSTAP. I also have many v6 kafka streams as sinks.

Screen Shot 2024-06-24 at 8 38 08 AM

References

No response

james-stevens commented 5 months ago

Hey John, what's the dnstap source ?

error=bytes remaining on stream sounds to me like vector thought there were errors in the frame format

johnhtodd commented 5 months ago

Hey John, what's the dnstap source ?

error=bytes remaining on stream sounds to me like vector thought there were errors in the frame format

The source is dnsdist/dnstap (latest release.) It may be the case that dnsdist/dnstap sends an error of some sort, but that should be only for one event. And vector goes many days without falling into this error condition, with the same load all the time. It would be highly unexpected for dnsdnst/dnstap to suddenly shift to an "all-errors" modality, since there is no backwards data sent from vector to dnsdist - it should be just "ack" packets. Also, the fact that the stats don't seem to change, just the error counters rise, seems to be unusual and points away from dnstap being the source of the issue. Lastly, there is no increase in packet or byte count between the DNSTAP origin and this vector host as indicated via my node graphs, so I don't think this is a dnsdist-side issue, or at least not the ongoing large jump in error counters.

johnhtodd commented 5 months ago

Update: In my last event instance today, "systemctl restart vector" did clear the error condition.

esensar commented 5 months ago

Looking at the code, I think @james-stevens is right - it comes from the frame reader. The odd thing to me here is that there is no other effect than higher error count, since this should pretty much stop any events from flowing - error frames are just skipped.

james-stevens commented 5 months ago

We've been running dnsdist->vector using the dnstap source and had no framing issues up to 200K event/s - the problem we have is we've not been able to find any vector sink that can sink the data at anywhere near that rate, except blackhole !!! https://github.com/vectordotdev/vector/issues/20739

So what we get is back-pressure at the sink, which causes dropped frames at the dnsdist end of the dnstap socket. dnsdist will deliberately drop frames instead of blocking as that would disrupt the DNS service. But it doesn't cause framing issues.

dnsdist only makes a single dnstap connection to vector, for all its threads - whereas PowerDNS Recursor makes one dnstap per-thread - obv the latter makes scaling much easier!!

Our servers have plenty of spare capacity & none of the vector threads are over 20% CPU, so its really not clear where the bottleneck is - but that's a different problem!!!

esensar commented 5 months ago

Thanks @james-stevens. Do you have a way of testing dnstap in unix mode? Just to help pinpoint the issue. If it only occurs in tcp mode then this and #20739 might be related.

johnhtodd commented 5 months ago

Another point to notice that I haven't mentioned: the number of errors, once started, seems to be remarkably "flat" despite the number of events being sent to the DNSTAP source fluctuating quite a bit, at least according to what is being recorded by the prometheus metrics in Vector. We shift from 16k events per second to 26k events per second being sent by dnsdist to the DNSTAP consumer, but the error rate shown in the graphs above stays nearly exactly at 18.2k errors per second - the values are not as important as the fact that there is no variation in the error rate despite variation in the dnstap message transmission rate. Just as weird, the rate of events per second downstream from the dnstap ingestion remap fluctuates normally with volume, so the rest of the system does not "see" these errors that are triggering faults that are showing up in the graph associated with vector_component_errors_total.

james-stevens commented 5 months ago

what metric(s) are you looking at in that graph?

we're not getting the metric vector_component_errors_total which I suspect is cos we have expire_metrics_secs set to five mins - so if we're seeing no errors that metric could just get removed.

What I check is that vector_buffer_sent_events_total matches what I expect to the appropriate sink.

johnhtodd commented 5 months ago

Yes, the metric is "vector_component_errors_total". Our expire_metrics_secs is set to 22265 (yes, very long because we have metric aggregations that go for >6 hours in some cases.)

james-stevens commented 5 months ago

I can only see errors related to scraping metrics when under load - we use vector to aggregate all the metrics on the server so Prometheus doesn't need to be reconfigured every time we change things about.

We also had a few "disk full" incidents when running 300Kq/s tests for five mins. 350MB/s disk writes adds up fast.

The way we look for dropped frames is to get dnsdist to log - its one of the items it logs.

We then wrote some LUA to parse the log line & convert it into three new dnsdist metrics - dnstap frames sent, dropped & failed. No idea the difference between those last two, but we don't see any failed, just dropped when the load gets too high.