memhamwan / memhamwan.github.io

2 stars 0 forks source link

Get Authoritative DNS Back Up and Running #37

Closed turnrye closed 3 years ago

turnrye commented 3 years ago

MemHamWAN never fully deployed the HamWAN portal software consistent with the patterns expected from the developers in the PSDR group. The latest pass at running DNS by the team was an effort to have something “cloud native”, and it relied on just managing DNS directly with power dns. Unfortunately the infrastructure that was hosting that application crashed, and there was no backup. As a result, we have to rebuild our DNS server.

I’m going to suggest a few things to hopefully make this more resilient this time:

  1. Let’s deploy it initially as a single host — we can follow up with clustering on the next task
  2. Let’s deploy it at LEB where physical access is easiest, and there are the fewest hops between our edge and there; that way if we have any hidden problems in the net, it’s unlikely to impact us
  3. Let’s make core network services like this still on VMs rather than cloud native — we can investigate migrating in the future, but the engineering effort required there is too high given the urgency of having DNS back up for the net
  4. This in the past was a combined recursor and authoritative server; PDNS no longer supports that. Let’s just focus on authoritative DNS for now.
  5. Let’s use a mariadb backing this time rather than pgsql just to be a bit more friendly to newcomers
turnrye commented 3 years ago

“dns.leb.memhamwan.net” is now an Ubuntu 20 VM running on 44.34.128.177, hosted on esxi1.leb.memhamwan.net. Step one is complete. Note that I did not add the “hamwan” user, but instead I just have my personal account on there. Will need to follow up to determine what standard we want to use for users since our old one is... 5 years old. Going to track that in a separate issue.

turnrye commented 3 years ago

PowerDNS is now up-and-running. MariaDB is configured to now allow root besides from local host. A separate user was created named pdns with its own db named pdns for this.

turnrye commented 3 years ago

OK, I confirmed that the DNS server is running properly. I added SOA, NS, and A records for “ns1” at this point. For now, I’m going to update our DNS glue recorded to point to 44.34.128.177 — getting it back on any cast 44.34.132.1 will come in a later step.

turnrye commented 3 years ago

Making continued progress with this, but looks like someone started trying to brute force the esxi host via ssh (doh, I should've disabled that last night after I finished downloading the image!)

As a result, I cannot get into the esxi host at this point, and it seems like the VM has hung. I've blocked all traffic to esxi at the edge and will come back in an hour hopefully after things have calmed down and ESXI's "fail2ban" stops preventing me from logging in.

turnrye commented 3 years ago

OK, got back in thanks to that and disabled ssh on the esxi host.

Noticed in the syslog some entries complaining:


Oct 18 21:04:27 dns multipathd[708]: sda: add missing path
Oct 18 21:04:27 dns multipathd[708]: sda: failed to get udev uid: Invalid argument
Oct 18 21:04:27 dns multipathd[708]: sda: failed to get sysfs uid: Invalid argument
Oct 18 21:04:27 dns multipathd[708]: sda: failed to get sgio uid: No such file or directory

I read an article saying you just needed to add this to the host in ESXI, but it's not working:

disk.EnableUUID = TRUE

turnrye commented 3 years ago

Just tracked down why the VM keeps kernel panicking. It's because of this VMware bug. Looks like we need to upgrade ESXi. For now I've put their workaround in place.

turnrye commented 3 years ago

Finished setting up the powerdns-admin UI using this tutorial. Going to setup ssl for nginx now.

turnrye commented 3 years ago

Added certbot so the admin interface is now available both via http and https.

https://dns.leb.memhamwan.net or https://ns1.leb.memhamwan.net

turnrye commented 3 years ago

Follow up is #38