scottmuc / infrastructure

Documentation / Automation for personal third-party infrastructure
The Unlicense
10 stars 2 forks source link

Collect metrics for DNS #76

Closed scottmuc closed 1 month ago

scottmuc commented 1 month ago

I'd like to have a bit more visibility in the performance of my DNS setup. I'd use this information to tweak configuration (e.g.: cache size).

Preliminary researched has pointed me at:

scottmuc commented 1 month ago

Unbound Metrics

image

This is pretty cool so far but I will look to see if extended statistics is worth it.

scottmuc commented 1 month ago

DNS Requests While I Was Sleeping

image

9000 queries while I slept is a bit disturbing. Shows how much background activity is going on just on the LAN!

scottmuc commented 1 month ago

dnsmasq Metrics

image

Seeing the number of DHCP leases change over time might be interesting to see how dynamic my LAN is. I am going to guess it's only sam , frodo, and sauron that slip between being online and offline.

image

I can also see DNS metrics here too. Anything under dnsmasq_misses will get forwarded to unbound. The other metrics can help me tune dnsmasq once I have more time to record data.

scottmuc commented 1 month ago

Grafana Dashboard Notes

Found a useful resource for an unbound dashboard. The dashboard I found originally isn't supported anymore and is based on a different exporter implementation.

I think this is a small enough context to attempt building my own dashboard. It's about time I learned how to make grafana dashboards.

While reading that repo, I was convinced that enabling the extended statistics will be useful to know what are the most common requested domains.

scottmuc commented 1 month ago

Understanding Prometheus Counters

When dnsmasq or unbound restart, the related counter metrics reset to zero. It turns out I didn't quite understand how I should be using counter metrics.

scottmuc commented 1 month ago

Dnsmasq Monitored

Using the increase function (e.g.: increase(dnsmasq_hits{job="dnsmasq"}[$__range])), I can now specify the timespan in grafana and see the numbers match.

I'm not quite sure how to interpret the Cache insertions and evictions data, but this post does try and explain it.

I'm also not sure what hit rate I'm aiming for and whether or not this is a product of my cache configuration (left at default 150).

image

scottmuc commented 1 month ago

Unbound Monitored

There's a richer set of metrics with unbound. It's interesting to see that unbound has to do a magnitude more queries because it has to perform the recursion algorithm.

I've also curious why so many IPv6 resolutions are being performed. They seem to appear in spikes. Also, TIL about HTTPS records which is part of a Nov 2023 RFC 9460

image

scottmuc commented 1 month ago

Definitely happy to call dnsmasq done:

image

scottmuc commented 1 month ago

Calling unbound done for now too:

image

scottmuc commented 1 month ago

Summary

This was a fun exercise and got me to better understand the tools to create a dashboard. I've previously used off-the-self dashboard and never got too much into the guts of setting up the different types of visualizations. At the moment the dashboard is all configured by hand via the grafana UI. I'll need to at least save the JSON exports if I want to store them for safe keeping in case the USB stick the grafana DB is stored on dies.

Already, this has illuminated some details of the DNS traffic on my network.

I have a sneaky suspicion that my RIPE Atlas is responsible for many of these requests, but how many?