CI: Use VMs provided by DigitalOcean FOSS support effort, and document the lessons learned

jimklimov commented 10 months ago

Follow-up to #869 and #1729: since Dec 2022 NUT was accepted into DO sponsorship, although at a lowest tier (aka "testing phase") which did not allow for sufficiently "strong" machines for migration of all workloads from Fosshost VMs, until we took steps to promote the relationship in NUT media materials. This fell through the cracks a bit, due to other project endeavours - but as I was recently reminded, there is some follow-up to do on our side.

In order to keep the open source sponsorship program running it means a lot when you are able to do your best in sharing how DigitalOcean has supported your community. This can come in the form of social media mentions or better yet a technical walkthrough of how the infrastructure has been set up such as to satisfy your specific needs so others may learn and do the same. Anything of this sort would be greatly appreciated to keep this going. For git provider mentions such as adding our logo to your ReadMe file, please see the Sponsorship Attribution Guide which includes different versions of the DigitalOcean logo, link suggestions, and ready-to-use code snippets.

... the testing phase in order to understand that project maintainers are not just looking to get support but also willing to work with us as a community by also fulfilling the deliverables. Even though we genuinely want to support the open source ecosystem however, our humble ask in return for a credit grant should be worked on and that's why we rolled out the testing phase. At least starting with social media mentions, adding logo to GitHub repository as well as project's website if any, it shows the support means a lot to you.

https://opensource.nyc3.cdn.digitaloceanspaces.com/attribution/index.html

https://www.digitalocean.com/open-source/credits-for-projects

jimklimov commented 10 months ago

Posted a nut-website update for:

Website re-rendition pending...

jimklimov commented 10 months ago

Thanks to suggestions in offline parts of this discussion, several text resources of the NUT project should now (or soon) suggest that users/contributors "star" it on GitHub as a metric useful for sponsor consideration, including:

README (per #2194)
github forms (checklist for PR submission)
project description
nut-website

jimklimov commented 9 months ago

Updated DO URLs with referral campaign ID which gives bonus credits both to new DO users and to NUT CI farm account, if all goes well.

jimklimov commented 9 months ago

Bookmarking https://www.digitalocean.com/blog/custom-images et al subject:

https://openindiana.org/downloads/ => https://dlc.openindiana.org/isos/hipster/20231027/OI-hipster-cloudimage.img.gz - OI distro-provided cloud images (detailed at https://www.openindiana.org/announcements/openindiana-hipster-2023-04-announcement/ release notes, though not at later ones)
https://omnios.org/download => https://downloads.omnios.org/media/lts/omnios-r151046.cloud.vmdk (LTS) or https://downloads.omnios.org/media/stable/omnios-r151048.cloud.vmdk (recent stable) or daily "bloody" images like https://downloads.omnios.org/media/bloody/omnios-bloody-20231209.cloud.vmdk
https://ptribble.blogspot.com/2021/04/running-tribblix-on-digital-ocean.html - notes on custom image creation, may involve https://github.com/illumos/metadata-agent
https://dev.to/nabbisen/custom-openbsd-droplet-on-digitalocean-4a9o - piggyback OpenBSD via FreeBSD images (no longer offered by default on DO)
https://www.adminbyaccident.com/freebsd/how-to-upload-a-freebsd-custom-image-on-digitalocean/
https://bsd-cloud-image.org/ - A collection of prebuilt *BSD cloud images
https://openzfs.github.io/openzfs-docs/Getting%20Started/Debian/Debian%20Bookworm%20Root%20on%20ZFS.html - on moving our Linux VM onto ZFS root

jimklimov commented 9 months ago

For the purposes of eventually making an article on this setup, can as well start here...

According to the fine print in the scary official docs, DigitalOcean VMs can only use "custom images" in one of a number of virtual HDD formats, which carry an ext3/ext4 filesystem for DO add-ons to barge into for management.

In practice, uploading an OpenIndiana Hipster "cloud" image, also by providing an URL to an image file on the Internet (see above for some collections) sort of worked (status remains "pending" but a VM could be made with it); however following up with an OmniOS image failed (exceeded some limit) - I suppose, after ending the setups with one custom image, it can be nuked and then another used in its place. UPDATE: You have to wait a surprisingly long time, some 15-20 minutes, and additional images suddenly become "Uploaded".

The OI image could be loaded... but that's it - the logo is visible on the DO Rescue Console, as well as some early boot-loader lines ending with a list of supported consoles. I assume it went into the ttya console as present in the hardware, but DO UI does not make it accessible and I did not find quickly if there's REST API or SSH tunnel into serial ports. The console does not come up quickly enough after a VM (re-)boot for any interaction with the loader, if it offers any.

It probably booted, since I could see an rpool/swap twice the size of VM RAM later on, and the rpool occupied the whole VM disk (auto-sizing).

The VM can however be rebooted with a (DO-provided) Rescue ISO, based on Ubuntu 18.04 LTS with ZFS support - which is sufficient to send over the existing VM contents from original OI VM on Fosshost.

The rescue live image allows to install APT packages, such as mc (file manager and editor) and mbuffer (to optimize zfs-send/recv). The menu walks through adding SSH public keys (can import ones from e.g. GitHub by username).

Note that if your client system uses screen, tmux or byobu, the new SSH connections would get the menu again. To get a shell, interactive or for scripting like rsync and zfs recv counterparts, export TERM=vt220 from your screen session (the latter is useful for independence of the long replication run from connectivity of my laptop to Fosshost/DigitalOcean VMs).

SSH keys can be imported with a helper:

#rescue# ssh-import-id-gh jimklimov
2023-12-10 21:32:18,069 INFO Already authorized ['2048', 'SHA256:Q/ouGDQn0HUZKVEIkHnC3c+POG1r03EVeRr81yP/TEoQ', 'jimklimov@github/10826393', '[RSA]']
...

More can be pasted into ~/.ssh/authorized_keys later
The real SSH session is better: much more responsive than the (VNC-based) Rescue Console, which also lacks mouse and copy-paste integration with your browser
On your SSH client side (e.g. in the screen session on original VM which would send a lot of data), you can add non-default (e.g. one-time) keys with:
```
#origin# eval `ssh-agent`
#origin# ssh-add ~/.ssh/id_rsa_custom_key
```

Make the rescue userland convenient:

#rescue# apt install mc mbuffer

ZFS send/recv is quite bursty, with long quiet times as it investigates the source or target pools respectively, and busy streaming times with data. Using an mbuffer on at least one side (both to smooth out network latency) is recommended to have something useful happen when at least one of the sides has the streaming phase.

I can import the cloud-OI ZFS pool into the Linux Rescue CD session:

#rescue# zpool import
   pool: rpool
       id: 7186602345686254327
  state: ONLINE
 status: The pool was last accessed by another system.
 action: The pool can be imported using its name or numeric identifier and the `-f' flag.
     see: http://zfsonlinux.org/msg/ZFS-8000-EY
 config: 
        rpool ONLINE
           vda ONLINE

#rescue# zpool import -R /a -N -f rpool

#rescue# zfs list
NAME                  USED  AVAIL  REFER  MOUNTPOINT
rpool                34.1G   276G   204K  /rpool
rpool/ROOT           1.13G   276G   184K  legacy
rpool/ROOT/c936500e  1.13G   276G  1.13G  legacy
rpool/export          384K   276G   200K  /export
rpool/export/home     184K   276G   184K  /export/home
rpool/swap           33.0G   309G   104K  -

A kernel core-dump area is missing, compared to the original VM... adding per best practice:

#origin# zfs get -s local all rpool/dump
NAME        PROPERTY                        VALUE                           SOURCE
rpool/dump  volsize                         1.46G                           local
rpool/dump  checksum                        off                             local
rpool/dump  compression                     off                             local
rpool/dump  refreservation                  none                            local
rpool/dump  dedup                           off                             local

#rescue# zfs create -V 2G -o checksum=off -o compression=off -o refreservation=none -o dedup=off rpool/dump

To receive ZFS streams from the running OI into the freshly prepared cloud-OI image, it wanted the ZFS features to be enabled (all disabled by default) since some are used in the replication stream:

### What is there initially?
#rescue# zpool get all
NAME   PROPERTY                       VALUE                          SOURCE
rpool  size                           320G                           -
rpool  capacity                       0%                             -
rpool  altroot                        -                              default
rpool  health                         ONLINE                         -
rpool  guid                           7186602345686254327            -
rpool  version                        -                              default
rpool  bootfs                         rpool/ROOT/c936500e            local
rpool  delegation                     on                             default
rpool  autoreplace                    off                            default
rpool  cachefile                      -                              default
rpool  failmode                       wait                           default
rpool  listsnapshots                  off                            default
rpool  autoexpand                     off                            default
rpool  dedupditto                     0                              default
rpool  dedupratio                     1.00x                          -
rpool  free                           318G                           -
rpool  allocated                      1.13G                          -
rpool  readonly                       off                            -
rpool  ashift                         12                             local
rpool  comment                        -                              default
rpool  expandsize                     -                              -
rpool  freeing                        0                              -
rpool  fragmentation                  -                              -
rpool  leaked                         0                              -
rpool  multihost                      off                            default
rpool  feature@async_destroy          disabled                       local
rpool  feature@empty_bpobj            disabled                       local
rpool  feature@lz4_compress           disabled                       local
rpool  feature@multi_vdev_crash_dump  disabled                       local
rpool  feature@spacemap_histogram     disabled                       local
rpool  feature@enabled_txg            disabled                       local
rpool  feature@hole_birth             disabled                       local
rpool  feature@extensible_dataset     disabled                       local
rpool  feature@embedded_data          disabled                       local
rpool  feature@bookmarks              disabled                       local
rpool  feature@filesystem_limits      disabled                       local
rpool  feature@large_blocks           disabled                       local
rpool  feature@large_dnode            disabled                       local
rpool  feature@sha512                 disabled                       local
rpool  feature@skein                  disabled                       local
rpool  feature@edonr                  disabled                       local
rpool  feature@userobj_accounting     disabled                       local

### Enable all features this pool knows about:
#rescue# zpool get all | grep feature@ | awk '{print $2}' | while read F ; do zpool set $F=enabled rpool ; done

On the original VM, snapshot all datasets recursively so whole data trees can be easily sent over (note that we then remove some snaps like for swap/dump areas which otherwise waste a lot of space over time with blocks of obsolete swap data held back):

#origin# zfs snapshot -r rpool@20231210-01
#origin# zfs destroy rpool/swap@20231210-01&
#origin# zfs destroy rpool/dump@20231210-01&

On the receiving VM, move existing rpool/ROOT out of the way, so the new one can land (for kicks, can zfs rename the cloud-image's boot environment back into the fold after replication is complete). Also prepare to maximally compress the received rootfs info, so it does not occupy too much in the new home (this is not something we write too often, so slower gzip-9 writes can be tolerated):

#rescue# zfs rename rpool/ROOT{,x} ; while ! zfs set compression=gzip-9 rpool/ROOT ; do sleep 0.2 || break ; done

Send over the data (from the prepared screen session on the origin server), e.g.:

### Do not let other work of the origin server preempt the replication
#origin# renice -n -20 $$
#origin# zfs send -Lce -R rpool/ROOT@20231210-01 | mbuffer | ssh root@rescue "mbuffer | zfs recv -vFnd rpool"

Remove -n from zfs recv after initial experiments confirm it receives what you want where you want it, and re-run.

With sufficiently large machines and slow source hosting, expect some hours for the transfer (I saw 4-8Mb/s in the streaming phase for large increments, and quite a bit of quiet time for enumeration of almost-empty regular snapshots - work with ZFS metadata has a cost).

Note that one of the benefits of ZFS (and the non-automatic snapshots used here) is that it is easy to catch-up later to send the data which the original server would generate and write during the replication. You can keep it working until the last minutes of the migration.

jimklimov commented 9 months ago

OI TODO (after the transfers complete):

[x] follow-up with a pass to stop original services and replicate info made during this transfer (and human outage)
[ ] zfs clone and zfs promote the anticipated rootfs; mount it to adjust networking info, and perhaps temporarily disable auto-start of ZFS replication (do not confuse the backup hosts, if any) and auto-start of local zones etc. (so the extra Jenkins does not confuse the world too early on and generally to have experimental reboots turn around safely quicker)
- [x] eventually skipped explicit cloning/promotion - after a successful first boot, updated the OS which in particular did the trick as one of the steps
- [x] kept znapzend disabled on the old VM (turning off the tunnel to off-site backup also serves to not confuse remote systems... but still, nothing critical is happening on this one anymore)
- [ ] enable jenkins build agent zone autoboot, keep jenkins controller zone disabled, on the old VM
- [x] re-enabled znapzend and zones SMF services on the new VM. Also revived the znapzend schedules (were "received" in ZFS properties of the replica, so ignored by the tool -- see znapzendzetup list on the original to get a list of datasets to check), e.g.:
```
:; zfs get -s received all rpool/{ROOT,export,export/home/abuild/.ccache,zones{,-nosnap}} \
| grep znapzend | while read P K V S ; do zfs set $K="$V" $P & done
```
[x] rsync rpool/boot/ which is in the rpool dataset and has boot-loader configs; update menu.lst
[x] zpool set bootfs=...
[x] touch reconfigure in the new rootfs (pick up changed hardware on boot)
[x] be ready to fiddle with /etc/dladm/datalink.conf (if using virtual links, etherstubs, etc.), /etc/hostname*, /etc/defaultrouter etc.
[x] revise the loader settings regarding the console to use (should be text first here on DO) in /boot/solaris/bootenv.rc and/or /boot/defaults/loader.conf
[x] reboot to see if it all actually "flies" :)
[x] check about cloud integration services (linked above)

jimklimov commented 9 months ago

WARNING: Per https://www.illumos.org/issues/14526 and personal and community practice, it seems that "slow reboot" for illumos VMs on QEMU-6.x (and DigitalOcean) misbehaves and hangs, the virtual hardware is not power-cycled. A power-off/on cycle through UI (and probably REST API) does work. Other kernels are not impacted, it seems.

Wondering if there are QEMU HW watchdogs on DO...

UPDATE: It took about 2 hours for rebooting... to take place in fact. At least, it would not be stuck for eternity in case of unattended crashes...

jimklimov commented 9 months ago

The metadata-agent seems buildable and installable, logged the SSH keys on console after service manifest import.

jimklimov commented 9 months ago

As of this writing, the NUT CI Jenkins controller runs on DigitalOcean - and feels a lot snappier in browsing and SSH management. The older Fosshost VMs are alive and used as its build agents (just the container with the old production Jenkins controller is not auto-booting anymore); with holidays abound it may take time to have them replicated onto DO.

The Jenkins SSH Build Agent setups involved here were copied on the controller (as XML files) and updated to tap into the different "host" and "port" (so that the original definitions can in time be used for replicas on DO), and due to trust settings - the ~jenkins/.ssh/known_hosts on the new controller had to be updated with the "new" remote system fingerprints. Otherwise it went smooth.

Similarly, existing Jenkins swarm agents from community PCs had to be taught the new DNS name (some had it in /etc/hosts) but otherwise connected OK.

jimklimov commented 9 months ago

Another limitation seen with "custom images" is that IPv6 is not offered to those VMs.

Generally all VMs get random (hopefully persistent) public IPv4 addresses from various subnets; it is possible to also request an interconnect VLAN for one's VMs co-located in same data center and have it attached (with virtual IP addresses) to another vioifX interface on each of your VMs: it is supposed to be faster and free (regarding traffic quotas). For the Jenkins controller which talks to the world (and enjoys an off-hosting backup at a maintainer's home server) having substantial monthly traffic quota is important. For builders (hosted on DO) that would primarily talk to the controller in the common VLAN - not so much (just OS upgrades?)

Another note regards pricing: resources that "exist" are billed, whether they run or not (e.g. turned-off VMs still reserve CPU/RAM to be able to run on demand, dormant storage for custom images is used even if they are not active filesystems, etc.). The hourly prices are for resources spawned and destroyed within a month. After a monthly-rate total price for the item is reached, it is applied instead.

jimklimov commented 9 months ago

Spinning up the Debian-based Linux builder (with many containers for various Linux systems) with ZFS, to be consistent across the board, was an adventure.

DigitalOcean rescue CD is Ubuntu 18.04 based, it has an older ZFS version so instructions from https://openzfs.github.io/openzfs-docs/Getting%20Started/Debian/Debian%20Stretch%20Root%20on%20ZFS.html have to be used particularly to zpool create bpool (with dumbed-down options for GRUB to read the boot-pool)
For the rest, https://openzfs.github.io/openzfs-docs/Getting%20Started/Debian/Debian%20Bookworm%20Root%20on%20ZFS.html is relevant for current distro and is well-written
Note that while in many portions the "MBR or (U)EFI" boot is a choice of one command to copy-paste or another, the spot about installing GRUB requires both (MBR for disk to be generally bootable, and EFI to proceed with that implementation).
If the (recovery) console with the final OS is too "tall" so the lower rows are hidden by the DO banner with IP address, and you can't see the commands you are typing, try clear ; stty size to check the current display size (was 128x48 for me) and stty rows 45 to reduce it a bit. Running a full-screen program like mc helps gauge if you got it right.

jimklimov commented 9 months ago

One more potential caveat: while DigitalOcean provides VPC network segments for free intercomms of a group of droplets, it assigns IP addresses to those and does not let any others be used by the guest. This causes some hassle when importing a set of VMs which used different IP addresses on the intercomm VLAN originally.

jimklimov commented 8 months ago

Added replicas of more existing VMs: FreeBSD 12 (needed to use a seed image, OI did not cut it - ZFS options in its pool were too new, so the older build of the BSD loader was not too eager to find the pool) and OmniOS (relatively straightforward with the OI image). Also keep in mind that the (old version of?) FreeBSD loader rejected a gzip-9 compressed zroot/ROOT location.

jimklimov commented 8 months ago

Added a replica of OpenBSD 6.5 VM as an example of relatively dated system in the CI, which went decently well as a dd stream of the local VM's vHDD into DO recovery console session:

tgt-recovery# mbuffer -4 -I 12340 > /dev/vda

src# dd if=/dev/rsd0c | time nc myHostingIP 12340

...followed by a reboot and subsequent adaptation of /etc/myname and /etc/hostname.vio* files.

I did not check if the DO recovery OS can mount BSD UFS partitions, it sufficed to log into the pre-configured system.

One caveat was that it got installed with X11, and DO console did not pass through the mouse nor advanced keyboard shortcuts. So rcctl disable xenodm (to reduce the attack surface and resource waste).

FWIW, openbsd-7.3-2023-04-22.qcow2 "custom image" did not seem to boot. At least, no activity on display and the IP address did not go up.

networkupstools / nut

CI: Use VMs provided by DigitalOcean FOSS support effort, and document the lessons learned #2192