networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
1.95k stars 346 forks source link

CI: Use VMs provided by DigitalOcean FOSS support effort, and document the lessons learned #2192

Open jimklimov opened 10 months ago

jimklimov commented 10 months ago

Follow-up to #869 and #1729: since Dec 2022 NUT was accepted into DO sponsorship, although at a lowest tier (aka "testing phase") which did not allow for sufficiently "strong" machines for migration of all workloads from Fosshost VMs, until we took steps to promote the relationship in NUT media materials. This fell through the cracks a bit, due to other project endeavours - but as I was recently reminded, there is some follow-up to do on our side.

In order to keep the open source sponsorship program running it means a lot when you are able to do your best in sharing how DigitalOcean has supported your community. This can come in the form of social media mentions or better yet a technical walkthrough of how the infrastructure has been set up such as to satisfy your specific needs so others may learn and do the same. Anything of this sort would be greatly appreciated to keep this going. For git provider mentions such as adding our logo to your ReadMe file, please see the Sponsorship Attribution Guide which includes different versions of the DigitalOcean logo, link suggestions, and ready-to-use code snippets.

... the testing phase in order to understand that project maintainers are not just looking to get support but also willing to work with us as a community by also fulfilling the deliverables. Even though we genuinely want to support the open source ecosystem however, our humble ask in return for a credit grant should be worked on and that's why we rolled out the testing phase. At least starting with social media mentions, adding logo to GitHub repository as well as project's website if any, it shows the support means a lot to you.

https://opensource.nyc3.cdn.digitaloceanspaces.com/attribution/index.html

https://www.digitalocean.com/open-source/credits-for-projects

jimklimov commented 10 months ago

Posted a nut-website update for:

Website re-rendition pending...

jimklimov commented 10 months ago

Thanks to suggestions in offline parts of this discussion, several text resources of the NUT project should now (or soon) suggest that users/contributors "star" it on GitHub as a metric useful for sponsor consideration, including:

jimklimov commented 9 months ago

Updated DO URLs with referral campaign ID which gives bonus credits both to new DO users and to NUT CI farm account, if all goes well.

jimklimov commented 9 months ago

Bookmarking https://www.digitalocean.com/blog/custom-images et al subject:

jimklimov commented 9 months ago

For the purposes of eventually making an article on this setup, can as well start here...

According to the fine print in the scary official docs, DigitalOcean VMs can only use "custom images" in one of a number of virtual HDD formats, which carry an ext3/ext4 filesystem for DO add-ons to barge into for management.

In practice, uploading an OpenIndiana Hipster "cloud" image, also by providing an URL to an image file on the Internet (see above for some collections) sort of worked (status remains "pending" but a VM could be made with it); however following up with an OmniOS image failed (exceeded some limit) - I suppose, after ending the setups with one custom image, it can be nuked and then another used in its place. UPDATE: You have to wait a surprisingly long time, some 15-20 minutes, and additional images suddenly become "Uploaded".

The OI image could be loaded... but that's it - the logo is visible on the DO Rescue Console, as well as some early boot-loader lines ending with a list of supported consoles. I assume it went into the ttya console as present in the hardware, but DO UI does not make it accessible and I did not find quickly if there's REST API or SSH tunnel into serial ports. The console does not come up quickly enough after a VM (re-)boot for any interaction with the loader, if it offers any.

It probably booted, since I could see an rpool/swap twice the size of VM RAM later on, and the rpool occupied the whole VM disk (auto-sizing).

The VM can however be rebooted with a (DO-provided) Rescue ISO, based on Ubuntu 18.04 LTS with ZFS support - which is sufficient to send over the existing VM contents from original OI VM on Fosshost.

The rescue live image allows to install APT packages, such as mc (file manager and editor) and mbuffer (to optimize zfs-send/recv). The menu walks through adding SSH public keys (can import ones from e.g. GitHub by username).

Note that if your client system uses screen, tmux or byobu, the new SSH connections would get the menu again. To get a shell, interactive or for scripting like rsync and zfs recv counterparts, export TERM=vt220 from your screen session (the latter is useful for independence of the long replication run from connectivity of my laptop to Fosshost/DigitalOcean VMs).

SSH keys can be imported with a helper:

#rescue# ssh-import-id-gh jimklimov
2023-12-10 21:32:18,069 INFO Already authorized ['2048', 'SHA256:Q/ouGDQn0HUZKVEIkHnC3c+POG1r03EVeRr81yP/TEoQ', 'jimklimov@github/10826393', '[RSA]']
...

Make the rescue userland convenient:

#rescue# apt install mc mbuffer

I can import the cloud-OI ZFS pool into the Linux Rescue CD session:

#rescue# zpool import
   pool: rpool
       id: 7186602345686254327
  state: ONLINE
 status: The pool was last accessed by another system.
 action: The pool can be imported using its name or numeric identifier and the `-f' flag.
     see: http://zfsonlinux.org/msg/ZFS-8000-EY
 config: 
        rpool ONLINE
           vda ONLINE

#rescue# zpool import -R /a -N -f rpool

#rescue# zfs list
NAME                  USED  AVAIL  REFER  MOUNTPOINT
rpool                34.1G   276G   204K  /rpool
rpool/ROOT           1.13G   276G   184K  legacy
rpool/ROOT/c936500e  1.13G   276G  1.13G  legacy
rpool/export          384K   276G   200K  /export
rpool/export/home     184K   276G   184K  /export/home
rpool/swap           33.0G   309G   104K  -

A kernel core-dump area is missing, compared to the original VM... adding per best practice:

#origin# zfs get -s local all rpool/dump
NAME        PROPERTY                        VALUE                           SOURCE
rpool/dump  volsize                         1.46G                           local
rpool/dump  checksum                        off                             local
rpool/dump  compression                     off                             local
rpool/dump  refreservation                  none                            local
rpool/dump  dedup                           off                             local

#rescue# zfs create -V 2G -o checksum=off -o compression=off -o refreservation=none -o dedup=off rpool/dump

To receive ZFS streams from the running OI into the freshly prepared cloud-OI image, it wanted the ZFS features to be enabled (all disabled by default) since some are used in the replication stream:

### What is there initially?
#rescue# zpool get all
NAME   PROPERTY                       VALUE                          SOURCE
rpool  size                           320G                           -
rpool  capacity                       0%                             -
rpool  altroot                        -                              default
rpool  health                         ONLINE                         -
rpool  guid                           7186602345686254327            -
rpool  version                        -                              default
rpool  bootfs                         rpool/ROOT/c936500e            local
rpool  delegation                     on                             default
rpool  autoreplace                    off                            default
rpool  cachefile                      -                              default
rpool  failmode                       wait                           default
rpool  listsnapshots                  off                            default
rpool  autoexpand                     off                            default
rpool  dedupditto                     0                              default
rpool  dedupratio                     1.00x                          -
rpool  free                           318G                           -
rpool  allocated                      1.13G                          -
rpool  readonly                       off                            -
rpool  ashift                         12                             local
rpool  comment                        -                              default
rpool  expandsize                     -                              -
rpool  freeing                        0                              -
rpool  fragmentation                  -                              -
rpool  leaked                         0                              -
rpool  multihost                      off                            default
rpool  feature@async_destroy          disabled                       local
rpool  feature@empty_bpobj            disabled                       local
rpool  feature@lz4_compress           disabled                       local
rpool  feature@multi_vdev_crash_dump  disabled                       local
rpool  feature@spacemap_histogram     disabled                       local
rpool  feature@enabled_txg            disabled                       local
rpool  feature@hole_birth             disabled                       local
rpool  feature@extensible_dataset     disabled                       local
rpool  feature@embedded_data          disabled                       local
rpool  feature@bookmarks              disabled                       local
rpool  feature@filesystem_limits      disabled                       local
rpool  feature@large_blocks           disabled                       local
rpool  feature@large_dnode            disabled                       local
rpool  feature@sha512                 disabled                       local
rpool  feature@skein                  disabled                       local
rpool  feature@edonr                  disabled                       local
rpool  feature@userobj_accounting     disabled                       local

### Enable all features this pool knows about:
#rescue# zpool get all | grep feature@ | awk '{print $2}' | while read F ; do zpool set $F=enabled rpool ; done

On the original VM, snapshot all datasets recursively so whole data trees can be easily sent over (note that we then remove some snaps like for swap/dump areas which otherwise waste a lot of space over time with blocks of obsolete swap data held back):

#origin# zfs snapshot -r rpool@20231210-01
#origin# zfs destroy rpool/swap@20231210-01&
#origin# zfs destroy rpool/dump@20231210-01&

On the receiving VM, move existing rpool/ROOT out of the way, so the new one can land (for kicks, can zfs rename the cloud-image's boot environment back into the fold after replication is complete). Also prepare to maximally compress the received rootfs info, so it does not occupy too much in the new home (this is not something we write too often, so slower gzip-9 writes can be tolerated):

#rescue# zfs rename rpool/ROOT{,x} ; while ! zfs set compression=gzip-9 rpool/ROOT ; do sleep 0.2 || break ; done

Send over the data (from the prepared screen session on the origin server), e.g.:

### Do not let other work of the origin server preempt the replication
#origin# renice -n -20 $$
#origin# zfs send -Lce -R rpool/ROOT@20231210-01 | mbuffer | ssh root@rescue "mbuffer | zfs recv -vFnd rpool"

With sufficiently large machines and slow source hosting, expect some hours for the transfer (I saw 4-8Mb/s in the streaming phase for large increments, and quite a bit of quiet time for enumeration of almost-empty regular snapshots - work with ZFS metadata has a cost).

Note that one of the benefits of ZFS (and the non-automatic snapshots used here) is that it is easy to catch-up later to send the data which the original server would generate and write during the replication. You can keep it working until the last minutes of the migration.

jimklimov commented 9 months ago

OI TODO (after the transfers complete):

jimklimov commented 9 months ago

WARNING: Per https://www.illumos.org/issues/14526 and personal and community practice, it seems that "slow reboot" for illumos VMs on QEMU-6.x (and DigitalOcean) misbehaves and hangs, the virtual hardware is not power-cycled. A power-off/on cycle through UI (and probably REST API) does work. Other kernels are not impacted, it seems.

Wondering if there are QEMU HW watchdogs on DO...

UPDATE: It took about 2 hours for rebooting... to take place in fact. At least, it would not be stuck for eternity in case of unattended crashes...

jimklimov commented 9 months ago

The metadata-agent seems buildable and installable, logged the SSH keys on console after service manifest import.

jimklimov commented 9 months ago

As of this writing, the NUT CI Jenkins controller runs on DigitalOcean - and feels a lot snappier in browsing and SSH management. The older Fosshost VMs are alive and used as its build agents (just the container with the old production Jenkins controller is not auto-booting anymore); with holidays abound it may take time to have them replicated onto DO.

The Jenkins SSH Build Agent setups involved here were copied on the controller (as XML files) and updated to tap into the different "host" and "port" (so that the original definitions can in time be used for replicas on DO), and due to trust settings - the ~jenkins/.ssh/known_hosts on the new controller had to be updated with the "new" remote system fingerprints. Otherwise it went smooth.

Similarly, existing Jenkins swarm agents from community PCs had to be taught the new DNS name (some had it in /etc/hosts) but otherwise connected OK.

jimklimov commented 9 months ago

Another limitation seen with "custom images" is that IPv6 is not offered to those VMs.

Generally all VMs get random (hopefully persistent) public IPv4 addresses from various subnets; it is possible to also request an interconnect VLAN for one's VMs co-located in same data center and have it attached (with virtual IP addresses) to another vioifX interface on each of your VMs: it is supposed to be faster and free (regarding traffic quotas). For the Jenkins controller which talks to the world (and enjoys an off-hosting backup at a maintainer's home server) having substantial monthly traffic quota is important. For builders (hosted on DO) that would primarily talk to the controller in the common VLAN - not so much (just OS upgrades?)

Another note regards pricing: resources that "exist" are billed, whether they run or not (e.g. turned-off VMs still reserve CPU/RAM to be able to run on demand, dormant storage for custom images is used even if they are not active filesystems, etc.). The hourly prices are for resources spawned and destroyed within a month. After a monthly-rate total price for the item is reached, it is applied instead.

jimklimov commented 9 months ago

Spinning up the Debian-based Linux builder (with many containers for various Linux systems) with ZFS, to be consistent across the board, was an adventure.

jimklimov commented 9 months ago

One more potential caveat: while DigitalOcean provides VPC network segments for free intercomms of a group of droplets, it assigns IP addresses to those and does not let any others be used by the guest. This causes some hassle when importing a set of VMs which used different IP addresses on the intercomm VLAN originally.

jimklimov commented 8 months ago

Added replicas of more existing VMs: FreeBSD 12 (needed to use a seed image, OI did not cut it - ZFS options in its pool were too new, so the older build of the BSD loader was not too eager to find the pool) and OmniOS (relatively straightforward with the OI image). Also keep in mind that the (old version of?) FreeBSD loader rejected a gzip-9 compressed zroot/ROOT location.

jimklimov commented 8 months ago

Added a replica of OpenBSD 6.5 VM as an example of relatively dated system in the CI, which went decently well as a dd stream of the local VM's vHDD into DO recovery console session:

tgt-recovery# mbuffer -4 -I 12340 > /dev/vda

src# dd if=/dev/rsd0c | time nc myHostingIP 12340

...followed by a reboot and subsequent adaptation of /etc/myname and /etc/hostname.vio* files.

I did not check if the DO recovery OS can mount BSD UFS partitions, it sufficed to log into the pre-configured system.

One caveat was that it got installed with X11, and DO console did not pass through the mouse nor advanced keyboard shortcuts. So rcctl disable xenodm (to reduce the attack surface and resource waste).

FWIW, openbsd-7.3-2023-04-22.qcow2 "custom image" did not seem to boot. At least, no activity on display and the IP address did not go up.