Closed scottmuc closed 3 months ago
TASK [Ensure fail2ban is running] ************************************************************************************************************
fatal: [192.168.2.102]: FAILED! => {"changed": false, "msg": "Unable to start service fail2ban: Failed to start fail2ban.service: Connection t
imed out\nSee system logs and 'systemctl status fail2ban.service' for details.\n"}
ansible@pippin:~ $ sudo systemctl status fail2ban
× fail2ban.service - Fail2Ban Service
Loaded: loaded (/lib/systemd/system/fail2ban.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Sat 2024-08-24 06:21:24 BST; 3min 26s ago
Duration: 686ms
Docs: man:fail2ban(1)
Process: 2003 ExecStart=/usr/bin/fail2ban-server -xf start (code=exited, status=255/EXCEPTION)
Main PID: 2003 (code=exited, status=255/EXCEPTION)
CPU: 664ms
Aug 24 06:21:23 pippin systemd[1]: Started fail2ban.service - Fail2Ban Service.
Aug 24 06:21:24 pippin fail2ban-server[2003]: 2024-08-24 06:21:24,283 fail2ban.configreader [2003]: WARNING 'allowipv6' not defined in 'Def>
Aug 24 06:21:24 pippin fail2ban-server[2003]: 2024-08-24 06:21:24,372 fail2ban [2003]: ERROR Failed during configuration: Ha>
Aug 24 06:21:24 pippin fail2ban-server[2003]: 2024-08-24 06:21:24,385 fail2ban [2003]: ERROR Async configuration of server f>
Aug 24 06:21:24 pippin systemd[1]: fail2ban.service: Main process exited, code=exited, status=255/EXCEPTION
Aug 24 06:21:24 pippin systemd[1]: fail2ban.service: Failed with result 'exit-code'.
ansible@pippin:~ $ sudo systemctl status fail2ban
× fail2ban.service - Fail2Ban Service
Loaded: loaded (/lib/systemd/system/fail2ban.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Sat 2024-08-24 06:21:24 BST; 3min 49s ago
Duration: 686ms
Docs: man:fail2ban(1)
Process: 2003 ExecStart=/usr/bin/fail2ban-server -xf start (code=exited, status=255/EXCEPTION)
Main PID: 2003 (code=exited, status=255/EXCEPTION)
CPU: 664ms
Aug 24 06:21:23 pippin systemd[1]: Started fail2ban.service - Fail2Ban Service.
Aug 24 06:21:24 pippin fail2ban-server[2003]: 2024-08-24 06:21:24,283 fail2ban.configreader [2003]: WARNING 'allowipv6' not defined in 'Definition'. Using default one: 'auto'
Aug 24 06:21:24 pippin fail2ban-server[2003]: 2024-08-24 06:21:24,372 fail2ban [2003]: ERROR Failed during configuration: Have not found any log file for sshd jail
Aug 24 06:21:24 pippin fail2ban-server[2003]: 2024-08-24 06:21:24,385 fail2ban [2003]: ERROR Async configuration of server failed
Aug 24 06:21:24 pippin systemd[1]: fail2ban.service: Main process exited, code=exited, status=255/EXCEPTION
Aug 24 06:21:24 pippin systemd[1]: fail2ban.service: Failed with result 'exit-code'.
The fail2ban
process was expecting a specific file path to exist. Given this is a fresh install, no logs exist yet.
fail2ban
requirements or, continue gracefullyTASK [Ensure unbound is running] *************************************************************************************************************
**********************************************************************************************************************************************
fatal: [192.168.2.102]: FAILED! => {"changed": false, "msg": "Unable to start service unbound: Job for unbound.service failed because the cont
rol process exited with error code.\nSee \"systemctl status unbound.service\" and \"journalctl -xeu unbound.service\" for details.\n"}
2024-08-24T06:29:06.589123+01:00 pippin unbound[3447]: [1724477346] unbound[3447:0] error: can't bind socket: Cannot assign requested address
for 192.168.2.10 port 53
2024-08-24T06:29:06.589592+01:00 pippin unbound[3447]: [1724477346] unbound[3447:0] fatal error: could not open ports
2024-08-24T06:29:06.595491+01:00 pippin systemd[1]: unbound.service: Main process exited, code=exited, status=1/FAILURE
2024-08-24T06:29:06.986871+01:00 pippin systemd[1]: unbound.service: Failed with result 'exit-code'.
2024-08-24T06:29:06.988055+01:00 pippin systemd[1]: Failed to start unbound.service - Unbound DNS server.
2024-08-24T06:29:06.989647+01:00 pippin systemd[1]: unbound.service: Consumed 1.410s CPU time.
2024-08-24T06:29:07.000277+01:00 pippin systemd[1]: unbound-resolvconf.service - Unbound asyncronous resolvconf update helper was skipped beca
use of an unmet condition check (ConditionFileIsExecutable=/sbin/resolvconf).
2024-08-24T06:29:07.143311+01:00 pippin systemd[1]: unbound.service: Scheduled restart job, restart counter is at 1.
2024-08-24T06:29:07.147414+01:00 pippin systemd[1]: Stopped unbound.service - Unbound DNS server.
2024-08-24T06:29:07.147827+01:00 pippin systemd[1]: unbound.service: Consumed 1.410s CPU time.
2024-08-24T06:29:07.174238+01:00 pippin systemd[1]: Starting unbound.service - Unbound DNS server...
ansible@pippin:~ $ sudo netstat -lntp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:53 0.0.0.0:* LISTEN 3096/dnsmasq
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 772/sshd: /usr/sbin
tcp6 0 0 :::53 :::* LISTEN 3096/dnsmasq
tcp6 0 0 :::22 :::* LISTEN 772/sshd: /usr/sbin
There are 2 reasons for this failure. One is that dnsmasq
bound itself to port 53 on install. My custom configuration hasn't taken effect yet because the service needs to be restarted (usually happens at the end of a playbook run).
The second reason is that unbound
is trying to bind to 192.168.2.10
and the device hasn't been set to that IP yet. At repave time, it's got a random IP assigned to it via DHCP.
TASK [Install packages] **********************************************************************************************************************
fatal: [192.168.2.102]: FAILED! => {"changed": false, "msg": "No package matching 'promtail' is available"}
This worked in https://github.com/scottmuc/infrastructure/commit/84ee8f90d0f150748d656caeb01a2f3f79b4c28e. My guess is that adding the grafana
apt repository made this available to me. So I'll need set that up earlier in the playbook.
apt.grafana.com
before attempting to install promtail
using apt
.TASK [Install loki configuration] ************************************************************************************************************
fatal: [192.168.2.102]: FAILED! => {"changed": false, "checksum": "1dc5dc270f796d259c05b7889dfe29d4d507e0ef", "msg": "Destination directory /e
tc/loki does not exist"}
Also, despite the package being installed, there doesn't seem to be a service defined for it now:
root@pippin:/mnt/vcapstore/repos# systemctl start loki
Failed to start loki.service: Unit loki.service not found.
root@pippin:/mnt/vcapstore/repos# dpkg -l | grep loki
ii loki 2.4.7.4-10 arm64 MCMC linkage analysis on general pedigrees
Looks like some restructuring made it so that tasks/webserver.yml
was never included!
While trying to fix the unbound
issue, I thought I would bind it to 0,0,0,0
and forward local names to 127.0.0.1
. This seems problematic because there's a UDP port collision for 5353
:
root@pippin:/var/log# netstat -lnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 192.168.2.10:9153 0.0.0.0:* LISTEN 430/dnsmasq_exporte
tcp 0 0 0.0.0.0:5353 0.0.0.0:* LISTEN 670/dnsmasq
tcp 0 0 0.0.0.0:53 0.0.0.0:* LISTEN 693/unbound
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 678/sshd: /usr/sbin
tcp6 0 0 :::9167 :::* LISTEN 640/unbound_exporte
tcp6 0 0 :::4533 :::* LISTEN 626/navidrome
tcp6 0 0 :::3000 :::* LISTEN 756/grafana
tcp6 0 0 :::9100 :::* LISTEN 629/node_exporter
tcp6 0 0 :::5353 :::* LISTEN 670/dnsmasq
tcp6 0 0 :::22 :::* LISTEN 678/sshd: /usr/sbin
udp 0 0 0.0.0.0:5353 0.0.0.0:* 670/dnsmasq
udp 0 0 0.0.0.0:5353 0.0.0.0:* 427/avahi-daemon: r
udp 0 0 0.0.0.0:54835 0.0.0.0:* 427/avahi-daemon: r
udp 0 0 0.0.0.0:53 0.0.0.0:* 693/unbound
udp 0 0 0.0.0.0:67 0.0.0.0:* 670/dnsmasq
udp6 0 0 :::5353 :::* 670/dnsmasq
udp6 0 0 :::5353 :::* 427/avahi-daemon: r
udp6 0 0 :::45551 :::* 427/avahi-daemon: r
The avahi-daemon
looks to be related to bonjour
/mDNS
which uses the .local
TLD. This is not a system I want to have. This may have always been running before. Right now this is causing local name lookups to fail.
avahi
The above comment shows that things are listening on IPv6 when I explicitly disable it.
~/workspace/infrastructure/devices/pippin ? git push
Enumerating objects: 9, done.
Counting objects: 100% (9/9), done.
Delta compression using up to 12 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 1.12 KiB | 1.12 MiB/s, done.
Total 5 (delta 4), reused 0 (delta 0), pack-reused 0
error: remote unpack failed: unable to create temporary object directory
To git.scottmuc.com:infrastructure.git
! [remote rejected] main -> main (unpacker error)
error: failed to push some refs to 'git.scottmuc.com:infrastructure.git'
This is because after a repave, the git
user has a new UID.
root@pippin:/mnt/vcapstore/repos/infrastructure.git# ls -la
total 40
drwxr-xr-x 7 git git 4096 Aug 1 07:46 .
drwxr-xr-x 7 prometheus 988 4096 May 26 16:47 ..
drwxr-xr-x 2 prometheus 988 4096 May 26 16:47 branches
-rw-r--r-- 1 prometheus 988 66 May 26 16:47 config
-rw-r--r-- 1 prometheus 988 73 May 26 16:47 description
-rw-r--r-- 1 prometheus 988 21 May 26 17:58 HEAD
drwxr-xr-x 2 prometheus 988 4096 May 26 16:47 hooks
drwxr-xr-x 2 prometheus 988 4096 May 26 16:47 info
drwxr-xr-x 253 prometheus 988 4096 Aug 1 07:46 objects
drwxr-xr-x 4 prometheus 988 4096 May 26 16:47 refs
git
user with a stable UID/GID or chown -R git:git /opt/vcapstore/repos
root@pippin:/mnt/vcapstore/repos/infrastructure.git# systemctl status prometheus
× prometheus.service - Prometheus
Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Sat 2024-08-24 08:55:49 BST; 2h 15min ago
Duration: 385ms
Process: 7891 ExecStart=/opt/prometheus/live/prometheus --storage.tsdb.path=/mnt/vcapstore/prometheus --config.file=/opt/prometheus/prome>
Main PID: 7891 (code=exited, status=2)
CPU: 444ms
Aug 24 08:55:49 pippin systemd[1]: prometheus.service: Failed with result 'exit-code'.
Aug 24 08:55:49 pippin systemd[1]: prometheus.service: Scheduled restart job, restart counter is at 5.
Aug 24 08:55:49 pippin systemd[1]: Stopped prometheus.service - Prometheus.
Aug 24 08:55:49 pippin systemd[1]: prometheus.service: Start request repeated too quickly.
Aug 24 08:55:49 pippin systemd[1]: prometheus.service: Failed with result 'exit-code'.
Aug 24 08:55:49 pippin systemd[1]: Failed to start prometheus.service - Prometheus.
Sure enough, permissions on the persistent disk are incorrect:
root@pippin:/mnt/vcapstore# ls -la
total 60
drwxrwxrwx 12 root root 4096 Jul 14 06:12 .
drwxr-xr-x 4 root root 4096 Aug 24 06:44 ..
drwxr-xr-x 2 promtail nogroup 4096 Aug 24 06:02 compactor
drwxr-xr-x 7 grafana admin 4096 Aug 24 11:08 grafana
drwxr-xr-x 8 promtail nogroup 4096 Jul 6 07:14 loki
drwx------ 2 root root 16384 Apr 20 16:20 lost+found
drwxr-xr-x 3 navidrome admin 4096 Aug 24 07:37 navidrome
drwxr-xr-x 26 git admin 4096 Aug 24 06:02 prometheus
drwxr-xr-x 7 git git 4096 May 26 16:47 repos
drwxr-xr-x 7 promtail nogroup 4096 Jul 14 06:12 tsdb-shipper-active
drwxr-xr-x 2 promtail nogroup 4096 Aug 19 16:12 tsdb-shipper-cache
drwxr-xr-x 3 promtail nogroup 4096 Aug 24 06:02 wal
I think because of the issues of this repave, the UID/GIDs of the users that get created got made in a different order than normal. This resulted in the permissions of some paths in /mnt/vcapstore
having some crossed wires.
git
issue, use a stable UID or require an instruction to bring permissions backThe Fail2Ban Issue is an odd one, but could simply be a product of being one of the first packages and services to install. It's working now, and I didn't really do anything except comment it out on the initial run and then added it back.
The DNS related issued (one, two were definitely due to my recent DNS work. I failed to take into consideration the repaving context. It's a trick one because my LAN only has 1 DNS resolver. If I had 2 this setup wouldn't be so sensitive (I need functional DNS to perform the repave). I would like my unbound
configuration to specify the static IP address, but the machine needs to be that IP... in order for my machine to have the static IP, I need working DNS.
The Promtail Issue is one of those situations where I didn't realize that my code depended on actions performed in other tasks file. In this case, it was the addition of the Grafana apt repository.
The Loki Issue is a puzzle so far. I can't find a service definition. I'll need to cross reference the version with the one that was running before (if I captured that information).
The Nginx Issue was again me failing to test things correctly after I made structural changes to the code. Including it back worked without any fuss.
Both the Git Issue and Prometheus Issue were a good reminder that I've had an non-realized implicit dependency on the order of when I create users because this determines their UID/GID assignments. This happened in a different order this time around which made the UID/GIDs on the external USB disk (which doesn't get repaved since it has persistent data) weren't aligned with the UID/GIDs of the users that were recreated as part of the repave.
Stuff being bound to IPv6 isn't really an issue, but I don't quite understand why that's happening.
Splitting the device setup into a series of playbook runs might help reveal where dependencies really are. Currently, there are 3 playbooks:
bootstrap-playbook.yml
- Gets executed once per repave. It's main responsibility is to create the ansible
user and fetch my public ssh keys from GitHub so I can ditch the bootstrap pi
user that uses a password to authenticate.main-playbook.yml
- Does everything else!update-keys-playbook.yml
- Only needs to run if I've added a new key pair to GitHub.The main-playbook.yml
does a lot of things. All the handlers to restart services reveals this. If I need a service to be restarted for subsequent tasks to run, that seems like a good seam for a new playbook to be created.
Some of the issues have been manually resolved, but don't have an implementation to sort out future repaves. Besides Loki, all the issues are self made due to the number of changes I've made in the last 3 months. A secondary DNS server would really simplify things because I wouldn't need to do a resolver dance when I switch from a DHCP assigned network config to a static config.
Calling this complete. Will tinker away at the incomplete sub-tasks and reference this issue if/when they get resolved.
I believe due to the fact that I hadn't updated apt
when attempting to install Loki, I think I conflicted with another packaged named loki. A full reinstall was able to get me the correct loki with a systemd unit and config file. Unfortunately, it appears that it never created the loki user and re-installation doesn't end up doing it. I created a loki user manually and the service is up and running again.
Yay for Repaving!
As much as possible is documented inline in this issue template. In case of problems you may find help by viewing all the previous repave issues. Have fun!
Things to do with the existing build
[x] Enable DHCP on the router, remove port mapping and statically assign network to PC
This is very important if repaving from the Windows PC. It being bound to 192.168.2.12 is necessary for the the automation to work. Changing the DNS should be sufficient.
[x] Shutdown PI
Make sure the USB drive has spun down before doing any work.
sudo shutdown -h now
[x] Create SD card with the latest Raspberry Pi OS
Using the SD card in the now powered down PI.
The new installer has options to enable SSH and create a user.
installer download
note check if the underlying Debian distribution is changing as this might result in some issues in the playbook execution.
The Bookworm 64-bit lite image seems to work for now. note as of
v1.8.4
of the Imager software, ensure to not selectno filtering
in the Raspberry Pi Device filter.Post OS install steps on desktop
[x] Ensure a working ansible enviroment
This will exercise the
asdf
setup.[x] Turn on the PI and note the IP obtained from the Router
[x] Clean up old host keys
The new instance will have new host keys so to ensure host key warning messages don't distract us from the repaving, run the following:
[x] Transfer local public ssh key to PI
In order to avoid the use of
sshpass
, copy the current sessions public ssh key to to./ssh/authorized_keys
of thepi
user on the PI. This user is only necessary to run the bootstrap playbook (which creates an adminansible
user) and will be subsequently cleaned up.ssh-copy-id pi@<pi ip>
[x] Bootstrap with Ansible
./ansible.sh
and select thebootstrap-playbook.yml
[x] Add the PI port forwarding
Needed for the
certbot
ACME challenge in the next step.[x] Complete full configuration
./ansible.sh
and select themain-playbook.yml
[x] Reboot PI
[x] Re-add port mapping to the static IP
[x] Disable DHCP on the router
[x] Deploy goodenoughmoney.com
[x] Clean up host key for ephemeral IP
Remove host key reference to the temporary IP that was used to bootstrap the device. This cleanup will ensure that an error won't occur in the next refresh if the same IP is used again.
[x] Make this template slightly better
How Do I Know I Am Done?
[x] https://www.goodenoughmoney.com/ displays stuff
[x] https://home.scottmuc.com/music/ loads navidrome and the music is playable
[x] http://192.168.2.10:9090/ loads and has data
[x] http://192.168.2.10:3000/ loads and has data
[x]
ipconfig /release
and thenipconfig /renew
works[x]
nslookup analytics.google.com
is refused[x] Print out newly repaved machine details
cat /etc/os-release && uname -a