hashbang / admin-tools

Ansible playbooks and other admin tools used to administrate #! servers
MIT License
17 stars 10 forks source link

irc.hashbang.sh geoDNS setup is unreliable #70

Open KellerFuchs opened 7 years ago

KellerFuchs commented 7 years ago

We currently have an outage where lon1.irc.hashbang.sh fails all TLS handshakes. All users in Europe are only sent a record for lon1.

KellerFuchs commented 7 years ago

PS: That would be way easier if IRC had SRV support, but if wishes were fishes, ...

mayli commented 6 years ago

Could we respond with both irc servers sorted by geoDNS? Dunno if it's available or not.

KellerFuchs commented 6 years ago

@mayli Not sure what you mean by “sorted” in that case.

necrophcodr commented 6 years ago

You don't "sort" DNS anything. You just respond with what is needed. With DNS it's easy, because the geo-part is built in. Servers from EU only respond with European servers, US DNS only responds with US servers, simple as that. Using Route53 I'm sure allows for this?

What I mean is, why not just ask for irc.hashbang.sh, and let that entry be different depending on the geographical DNS locations. This way, a server from EU will respond faster in EU than a US server will in EU, hence only the EU entries are used by people located there.

RyanSquared commented 6 years ago

What I mean is, why not just ask for irc.hashbang.sh, and let that entry be different depending on the geographical DNS locations. This way, a server from EU will respond faster in EU than a US server will in EU, hence only the EU entries are used by people located there.

That is what we currently do; however, we've come up with a few issues because of this, as was pointed out above. If a TLS certificate is invalid, the server must be removed from the DNS ~queries~replies. If the server isn't actually alive, it also must be removed from the DNS ~queries~replies.

mayli commented 6 years ago

@KellerFuchs by "sorted" I mean, entries in DNS respond has an sorted "answer field" eg. order In most cases dns server are implemented to return them in arbitrary order to have some kind of DNS level load balance.

In the client side, it usually will try connect each entry by the order in the response. With those two combined, we could have a DNS level HA. The faster server is primary and slow server is backup.

KellerFuchs commented 6 years ago

why not just ask for irc.hashbang.sh, and let that entry be different depending on the geographical DNS locations

That was exactly what was in place. The issue was that the healthcheck, which was there to avoid sending users to a broken server, only checked that a TCP connections could be established; of course, when TLS broke, irc.hashbang.sh was suddenly broken for all Europe...

KellerFuchs commented 6 years ago

@mayli Except that ressource records are not ordered, or rather, quoting RFC 1034, 3.6, “the order of RRs in a set is not significant, and need not be preserved by name servers, resolvers, or other parts of the DNS”. In practice, many DNS resolvers randomize the order in a RRset, to prevent broken clients (cough Windows cough) from always hitting the “first” server.

The correct way to implement that would be SRV records (RFC 2782), but of course that's not a thing for IRC...

RyanSquared commented 6 years ago

14:27 \ hey Habbie you work with PowerDNS right? 14:27 \ i do 14:27 \ damn that was fast 14:28 \ if I wanted to have a GeoIP-based domain with live health checks, what would be the best way to do that? 14:28 \ Could I pack in cqueues and use cqueues in a checking mechanism? 14:29 \ in the auth luabackend you mean? 14:29 \ well i'm honestly not sure how lua integrates into it, but i'd assume so yes 14:29 \ assuming this is auth, your options are 14:29 \ - luabackend 14:29 \ - pipebackend 14:29 \ - remotebackend 14:30 \ luabackend has actual Lua states inside powerdns, and absolutely nothing happens in them except when a query comes in 14:30 \ which is not where you want to do your health checks because somebody is waiting for an answer 14:30 \ pipebackend and remotebackend integrate over pipes/sockets using either a simple line-based protocol or JSON (inside HTTP depending on choices you make) 14:30 \ in which case your end can do whatever the hell it wants as long as it responds over the socket 14:30 \ hm. alrighty. 14:31 \ so PowerDNS kinda acts like a frontend and then I can use a backend to form a response in the form of a Lua server? 14:31 \ yes 14:31 \ and you have to follow a few very simple rules 14:31 \ and powerdns will get all the DNS pain exactly right for you 14:31 \ I can do a cqueues async loop where the healthcheck runs every minute and still be able to send data across the socket 14:31 \ awesome :+1: 14:31 \ yes, that sounds good 14:32 \ so is it possible to set up this backend for just one subdomain, or would it apply for all to go to this backend? 14:32 \ the best short answer is 'run a separate pdns_server for this and put dnsdist in front to route queries' 14:33 \ alrighty 14:33 \ thank you for your time 14:33 \ using multiple backends in a single pdns_server is thorny, behaviour tends to subtly change between versions, so we don't recommend it 14:33 \ no problem 14:33 \ if you have more questions further down the road, OFTC #powerdns is welcoming and is not just me :)

So, currently the best solution is:

RyanSquared commented 6 years ago

This could also be relevant: https://gist.github.com/ahupowerdns/1e8bfbba95a277a4fac09cb3654eb2ac

KellerFuchs commented 6 years ago

FYI, using PowerDNS for GeoDNS means that we point everyone at our own DNS server, which isn't great for latency or reliability.

OTOH, AWS supports SSL healthchecks, which ought to be enough.

RyanSquared commented 6 years ago

OTOH, AWS supports SSL healthchecks, which ought to be enough.

But it also means relying on AWS. I figured we were hopefully going for something more "independent"? Testing such things on a local system would be harder without a builtin DNS setup.

KellerFuchs commented 6 years ago

Nevermind, it seems AWS supports SSL-based heathchecks for ELB but not for Route53. What the actual fuck. :O

KellerFuchs commented 6 years ago

@RyanSquared In principle, I would love us to run our own DNS infra. However, that basically means relying on 3rd-party services for replicas, for reliability & latency reasons (I don't happen to have an anycast DNS network in my backpocket... yet :P) and the standard ways of doing that don't support GeoDNS (because that's not something standardized).

As far as I can tell, we can pick 2 out of 3 from:

Frankly, I would be quite OK dropping GeoDNS in favor of the first two, esp. given how limited Route53's builtin healthchecks are, but that definitely would be a longer-term project. Also, it would need to be discussed with the other admins, and I don't feel that's a discussion that belongs in this issue.

mayli commented 6 years ago

@KellerFuchs how freenode solve this problem?

KellerFuchs commented 6 years ago

By not doing GeoDNS.

RyanSquared commented 6 years ago

@mayli Freenode, Esper, and many other servers just have a set of records that point to all their servers, independent of location. If users have an issue, it is recommended to instead set your client to a server (or to select from a list of servers) that works best for the user.

Not all servers might be listed (at the same time, or even in general) on the public interface, though. However, for our setup, it should be fine to just list them all. Plus, nothing against Freenode, but until recently their network management has been a bit clunky.

mayli commented 6 years ago

So, can we return all records as well? This seems the simple & stupid solution that works without too much effort. And we'd better use our bandwidth to focus on more important stuff, like userdb and other things.

RyanSquared commented 6 years ago

Yes. That is the "default" way most DNS servers return multiple results for one name.

KellerFuchs commented 6 years ago

Yes, endless discussion about a thing that is currently a non-issue is indeed consuming bandwidth...

RyanSquared commented 6 years ago

In order to close this issue - is the DNS setup in general still an issue? If we add a server are we going to have GeoIP enabled for it? If so, how should we remove this configuration?