scottmuc / infrastructure

Documentation / Automation for personal third-party infrastructure
The Unlicense
11 stars 2 forks source link

Rebuild Pippin - Summer 2024 #79

Closed scottmuc closed 3 months ago

scottmuc commented 3 months ago

Yay for Repaving!

As much as possible is documented inline in this issue template. In case of problems you may find help by viewing all the previous repave issues. Have fun!

Things to do with the existing build

Post OS install steps on desktop

How Do I Know I Am Done?

scottmuc commented 3 months ago

Fail2Ban Issue

TASK [Ensure fail2ban is running] ************************************************************************************************************
fatal: [192.168.2.102]: FAILED! => {"changed": false, "msg": "Unable to start service fail2ban: Failed to start fail2ban.service: Connection t
imed out\nSee system logs and 'systemctl status fail2ban.service' for details.\n"}
ansible@pippin:~ $ sudo systemctl status fail2ban
× fail2ban.service - Fail2Ban Service
     Loaded: loaded (/lib/systemd/system/fail2ban.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Sat 2024-08-24 06:21:24 BST; 3min 26s ago
   Duration: 686ms
       Docs: man:fail2ban(1)
    Process: 2003 ExecStart=/usr/bin/fail2ban-server -xf start (code=exited, status=255/EXCEPTION)
   Main PID: 2003 (code=exited, status=255/EXCEPTION)
        CPU: 664ms

Aug 24 06:21:23 pippin systemd[1]: Started fail2ban.service - Fail2Ban Service.
Aug 24 06:21:24 pippin fail2ban-server[2003]: 2024-08-24 06:21:24,283 fail2ban.configreader   [2003]: WARNING 'allowipv6' not defined in 'Def>
Aug 24 06:21:24 pippin fail2ban-server[2003]: 2024-08-24 06:21:24,372 fail2ban                [2003]: ERROR   Failed during configuration: Ha>
Aug 24 06:21:24 pippin fail2ban-server[2003]: 2024-08-24 06:21:24,385 fail2ban                [2003]: ERROR   Async configuration of server f>
Aug 24 06:21:24 pippin systemd[1]: fail2ban.service: Main process exited, code=exited, status=255/EXCEPTION
Aug 24 06:21:24 pippin systemd[1]: fail2ban.service: Failed with result 'exit-code'.
ansible@pippin:~ $ sudo systemctl status fail2ban
× fail2ban.service - Fail2Ban Service
     Loaded: loaded (/lib/systemd/system/fail2ban.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Sat 2024-08-24 06:21:24 BST; 3min 49s ago
   Duration: 686ms
       Docs: man:fail2ban(1)
    Process: 2003 ExecStart=/usr/bin/fail2ban-server -xf start (code=exited, status=255/EXCEPTION)
   Main PID: 2003 (code=exited, status=255/EXCEPTION)
        CPU: 664ms

Aug 24 06:21:23 pippin systemd[1]: Started fail2ban.service - Fail2Ban Service.
Aug 24 06:21:24 pippin fail2ban-server[2003]: 2024-08-24 06:21:24,283 fail2ban.configreader   [2003]: WARNING 'allowipv6' not defined in 'Definition'. Using default one: 'auto'
Aug 24 06:21:24 pippin fail2ban-server[2003]: 2024-08-24 06:21:24,372 fail2ban                [2003]: ERROR   Failed during configuration: Have not found any log file for sshd jail
Aug 24 06:21:24 pippin fail2ban-server[2003]: 2024-08-24 06:21:24,385 fail2ban                [2003]: ERROR   Async configuration of server failed
Aug 24 06:21:24 pippin systemd[1]: fail2ban.service: Main process exited, code=exited, status=255/EXCEPTION
Aug 24 06:21:24 pippin systemd[1]: fail2ban.service: Failed with result 'exit-code'.

Reason for Failure

The fail2ban process was expecting a specific file path to exist. Given this is a fresh install, no logs exist yet.

scottmuc commented 3 months ago

Unbound Issue

TASK [Ensure unbound is running] *************************************************************************************************************
**********************************************************************************************************************************************
fatal: [192.168.2.102]: FAILED! => {"changed": false, "msg": "Unable to start service unbound: Job for unbound.service failed because the cont
rol process exited with error code.\nSee \"systemctl status unbound.service\" and \"journalctl -xeu unbound.service\" for details.\n"}
2024-08-24T06:29:06.589123+01:00 pippin unbound[3447]: [1724477346] unbound[3447:0] error: can't bind socket: Cannot assign requested address
for 192.168.2.10 port 53
2024-08-24T06:29:06.589592+01:00 pippin unbound[3447]: [1724477346] unbound[3447:0] fatal error: could not open ports
2024-08-24T06:29:06.595491+01:00 pippin systemd[1]: unbound.service: Main process exited, code=exited, status=1/FAILURE
2024-08-24T06:29:06.986871+01:00 pippin systemd[1]: unbound.service: Failed with result 'exit-code'.
2024-08-24T06:29:06.988055+01:00 pippin systemd[1]: Failed to start unbound.service - Unbound DNS server.
2024-08-24T06:29:06.989647+01:00 pippin systemd[1]: unbound.service: Consumed 1.410s CPU time.
2024-08-24T06:29:07.000277+01:00 pippin systemd[1]: unbound-resolvconf.service - Unbound asyncronous resolvconf update helper was skipped beca
use of an unmet condition check (ConditionFileIsExecutable=/sbin/resolvconf).
2024-08-24T06:29:07.143311+01:00 pippin systemd[1]: unbound.service: Scheduled restart job, restart counter is at 1.
2024-08-24T06:29:07.147414+01:00 pippin systemd[1]: Stopped unbound.service - Unbound DNS server.
2024-08-24T06:29:07.147827+01:00 pippin systemd[1]: unbound.service: Consumed 1.410s CPU time.
2024-08-24T06:29:07.174238+01:00 pippin systemd[1]: Starting unbound.service - Unbound DNS server...
ansible@pippin:~ $ sudo netstat -lntp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:53              0.0.0.0:*               LISTEN      3096/dnsmasq
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      772/sshd: /usr/sbin
tcp6       0      0 :::53                   :::*                    LISTEN      3096/dnsmasq
tcp6       0      0 :::22                   :::*                    LISTEN      772/sshd: /usr/sbin

Reason for Failure

There are 2 reasons for this failure. One is that dnsmasq bound itself to port 53 on install. My custom configuration hasn't taken effect yet because the service needs to be restarted (usually happens at the end of a playbook run).

The second reason is that unbound is trying to bind to 192.168.2.10 and the device hasn't been set to that IP yet. At repave time, it's got a random IP assigned to it via DHCP.

scottmuc commented 3 months ago

Promtail Issue

TASK [Install packages] **********************************************************************************************************************
fatal: [192.168.2.102]: FAILED! => {"changed": false, "msg": "No package matching 'promtail' is available"}

This worked in https://github.com/scottmuc/infrastructure/commit/84ee8f90d0f150748d656caeb01a2f3f79b4c28e. My guess is that adding the grafana apt repository made this available to me. So I'll need set that up earlier in the playbook.

scottmuc commented 3 months ago

Loki Issue

TASK [Install loki configuration] ************************************************************************************************************
fatal: [192.168.2.102]: FAILED! => {"changed": false, "checksum": "1dc5dc270f796d259c05b7889dfe29d4d507e0ef", "msg": "Destination directory /e
tc/loki does not exist"}

Also, despite the package being installed, there doesn't seem to be a service defined for it now:

root@pippin:/mnt/vcapstore/repos# systemctl start loki
Failed to start loki.service: Unit loki.service not found.

root@pippin:/mnt/vcapstore/repos# dpkg -l | grep loki
ii  loki                                 2.4.7.4-10                       arm64        MCMC linkage analysis on general pedigrees
scottmuc commented 3 months ago

Nginx Issue

Looks like some restructuring made it so that tasks/webserver.yml was never included!

scottmuc commented 3 months ago

DNS Issues

While trying to fix the unbound issue, I thought I would bind it to 0,0,0,0 and forward local names to 127.0.0.1. This seems problematic because there's a UDP port collision for 5353:

root@pippin:/var/log# netstat -lnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 192.168.2.10:9153       0.0.0.0:*               LISTEN      430/dnsmasq_exporte
tcp        0      0 0.0.0.0:5353            0.0.0.0:*               LISTEN      670/dnsmasq
tcp        0      0 0.0.0.0:53              0.0.0.0:*               LISTEN      693/unbound
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      678/sshd: /usr/sbin
tcp6       0      0 :::9167                 :::*                    LISTEN      640/unbound_exporte
tcp6       0      0 :::4533                 :::*                    LISTEN      626/navidrome
tcp6       0      0 :::3000                 :::*                    LISTEN      756/grafana
tcp6       0      0 :::9100                 :::*                    LISTEN      629/node_exporter
tcp6       0      0 :::5353                 :::*                    LISTEN      670/dnsmasq
tcp6       0      0 :::22                   :::*                    LISTEN      678/sshd: /usr/sbin
udp        0      0 0.0.0.0:5353            0.0.0.0:*                           670/dnsmasq
udp        0      0 0.0.0.0:5353            0.0.0.0:*                           427/avahi-daemon: r
udp        0      0 0.0.0.0:54835           0.0.0.0:*                           427/avahi-daemon: r
udp        0      0 0.0.0.0:53              0.0.0.0:*                           693/unbound
udp        0      0 0.0.0.0:67              0.0.0.0:*                           670/dnsmasq
udp6       0      0 :::5353                 :::*                                670/dnsmasq
udp6       0      0 :::5353                 :::*                                427/avahi-daemon: r
udp6       0      0 :::45551                :::*                                427/avahi-daemon: r

The avahi-daemon looks to be related to bonjour/mDNS which uses the .local TLD. This is not a system I want to have. This may have always been running before. Right now this is causing local name lookups to fail.

scottmuc commented 3 months ago

IPV6 Issues

The above comment shows that things are listening on IPv6 when I explicitly disable it.

scottmuc commented 3 months ago

Git Issues

~/workspace/infrastructure/devices/pippin ? git push
Enumerating objects: 9, done.
Counting objects: 100% (9/9), done.
Delta compression using up to 12 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 1.12 KiB | 1.12 MiB/s, done.
Total 5 (delta 4), reused 0 (delta 0), pack-reused 0
error: remote unpack failed: unable to create temporary object directory
To git.scottmuc.com:infrastructure.git
 ! [remote rejected] main -> main (unpacker error)
error: failed to push some refs to 'git.scottmuc.com:infrastructure.git'

This is because after a repave, the git user has a new UID.

root@pippin:/mnt/vcapstore/repos/infrastructure.git# ls -la
total 40
drwxr-xr-x   7 git        git 4096 Aug  1 07:46 .
drwxr-xr-x   7 prometheus 988 4096 May 26 16:47 ..
drwxr-xr-x   2 prometheus 988 4096 May 26 16:47 branches
-rw-r--r--   1 prometheus 988   66 May 26 16:47 config
-rw-r--r--   1 prometheus 988   73 May 26 16:47 description
-rw-r--r--   1 prometheus 988   21 May 26 17:58 HEAD
drwxr-xr-x   2 prometheus 988 4096 May 26 16:47 hooks
drwxr-xr-x   2 prometheus 988 4096 May 26 16:47 info
drwxr-xr-x 253 prometheus 988 4096 Aug  1 07:46 objects
drwxr-xr-x   4 prometheus 988 4096 May 26 16:47 refs
scottmuc commented 3 months ago

Prometheus Issue

root@pippin:/mnt/vcapstore/repos/infrastructure.git# systemctl status prometheus
× prometheus.service - Prometheus
     Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Sat 2024-08-24 08:55:49 BST; 2h 15min ago
   Duration: 385ms
    Process: 7891 ExecStart=/opt/prometheus/live/prometheus --storage.tsdb.path=/mnt/vcapstore/prometheus --config.file=/opt/prometheus/prome>
   Main PID: 7891 (code=exited, status=2)
        CPU: 444ms

Aug 24 08:55:49 pippin systemd[1]: prometheus.service: Failed with result 'exit-code'.
Aug 24 08:55:49 pippin systemd[1]: prometheus.service: Scheduled restart job, restart counter is at 5.
Aug 24 08:55:49 pippin systemd[1]: Stopped prometheus.service - Prometheus.
Aug 24 08:55:49 pippin systemd[1]: prometheus.service: Start request repeated too quickly.
Aug 24 08:55:49 pippin systemd[1]: prometheus.service: Failed with result 'exit-code'.
Aug 24 08:55:49 pippin systemd[1]: Failed to start prometheus.service - Prometheus.

Sure enough, permissions on the persistent disk are incorrect:

root@pippin:/mnt/vcapstore# ls -la
total 60
drwxrwxrwx 12 root      root     4096 Jul 14 06:12 .
drwxr-xr-x  4 root      root     4096 Aug 24 06:44 ..
drwxr-xr-x  2 promtail  nogroup  4096 Aug 24 06:02 compactor
drwxr-xr-x  7 grafana   admin    4096 Aug 24 11:08 grafana
drwxr-xr-x  8 promtail  nogroup  4096 Jul  6 07:14 loki
drwx------  2 root      root    16384 Apr 20 16:20 lost+found
drwxr-xr-x  3 navidrome admin    4096 Aug 24 07:37 navidrome
drwxr-xr-x 26 git       admin    4096 Aug 24 06:02 prometheus
drwxr-xr-x  7 git       git      4096 May 26 16:47 repos
drwxr-xr-x  7 promtail  nogroup  4096 Jul 14 06:12 tsdb-shipper-active
drwxr-xr-x  2 promtail  nogroup  4096 Aug 19 16:12 tsdb-shipper-cache
drwxr-xr-x  3 promtail  nogroup  4096 Aug 24 06:02 wal

I think because of the issues of this repave, the UID/GIDs of the users that get created got made in a different order than normal. This resulted in the permissions of some paths in /mnt/vcapstore having some crossed wires.

scottmuc commented 3 months ago

Summary of Experience

The Fail2Ban Issue is an odd one, but could simply be a product of being one of the first packages and services to install. It's working now, and I didn't really do anything except comment it out on the initial run and then added it back.

The DNS related issued (one, two were definitely due to my recent DNS work. I failed to take into consideration the repaving context. It's a trick one because my LAN only has 1 DNS resolver. If I had 2 this setup wouldn't be so sensitive (I need functional DNS to perform the repave). I would like my unbound configuration to specify the static IP address, but the machine needs to be that IP... in order for my machine to have the static IP, I need working DNS.

The Promtail Issue is one of those situations where I didn't realize that my code depended on actions performed in other tasks file. In this case, it was the addition of the Grafana apt repository.

The Loki Issue is a puzzle so far. I can't find a service definition. I'll need to cross reference the version with the one that was running before (if I captured that information).

The Nginx Issue was again me failing to test things correctly after I made structural changes to the code. Including it back worked without any fuss.

Both the Git Issue and Prometheus Issue were a good reminder that I've had an non-realized implicit dependency on the order of when I create users because this determines their UID/GID assignments. This happened in a different order this time around which made the UID/GIDs on the external USB disk (which doesn't get repaved since it has persistent data) weren't aligned with the UID/GIDs of the users that were recreated as part of the repave.

Stuff being bound to IPv6 isn't really an issue, but I don't quite understand why that's happening.

What To Do About This?

Splitting the device setup into a series of playbook runs might help reveal where dependencies really are. Currently, there are 3 playbooks:

The main-playbook.yml does a lot of things. All the handlers to restart services reveals this. If I need a service to be restarted for subsequent tasks to run, that seems like a good seam for a new playbook to be created.

scottmuc commented 3 months ago

Repave Complete

Some of the issues have been manually resolved, but don't have an implementation to sort out future repaves. Besides Loki, all the issues are self made due to the number of changes I've made in the last 3 months. A secondary DNS server would really simplify things because I wouldn't need to do a resolver dance when I switch from a DHCP assigned network config to a static config.

Calling this complete. Will tinker away at the incomplete sub-tasks and reference this issue if/when they get resolved.

scottmuc commented 2 months ago

More Context on Loki Issues

I believe due to the fact that I hadn't updated apt when attempting to install Loki, I think I conflicted with another packaged named loki. A full reinstall was able to get me the correct loki with a systemd unit and config file. Unfortunately, it appears that it never created the loki user and re-installation doesn't end up doing it. I created a loki user manually and the service is up and running again.