scylladb / scylla-machine-image

Apache License 2.0
18 stars 25 forks source link

(ami,azure):Install latest LTS kernel during image build #443

Closed yaronkaikov closed 1 year ago

yaronkaikov commented 1 year ago

During image creation, we are running `apt-get full-upgrade which also updates the kernel (added as part of https://github.com/scylladb/scylla-machine-image/commit/90340275b80a3a54dcfc1e5ec660481ba167d1c3),

Since we want to use the LTS kernel version only, adding the kernel removal package and installation before we run scylla_install_image

Currently only AWS and Azure have LTS kernel for 22.04, once GCP will have it as well we should add it as well

Ref: https://github.com/scylladb/scylladb/issues/13560

mykaul commented 1 year ago

I'm approving this now, but I think we should replace shell scripting techniques (like awk, grep, head...) to python code, I will send a patch for it later.

Or Ansible...

mykaul commented 1 year ago

Do you need a reboot for this to take place?

fruch commented 1 year ago

Do you need a reboot for this to take place?

the next boot would be when someone would use this image :)

I don't think we need to restart as part of creation of the image

yaronkaikov commented 1 year ago

Do you need a reboot for this to take place?

yes, but it's during image creation anyway , so once you use the image you will get the right kernel

yaronkaikov commented 1 year ago

Verified with https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/ami/52/

[yaronkaikov@london]~/git/scylla-pkg/ansible (debug-new-servers)$ ssh scyllaadm@54.242.43.210
The authenticity of host '54.242.43.210 (54.242.43.210)' can't be established.
ED25519 key fingerprint is SHA256:H10isIs0kxplEiTQmBsG2cY8I1WM5aGuEkqfb3klnCk.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '54.242.43.210' (ED25519) to the list of known hosts.
Welcome to Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-1034-aws x86_64)
yaronkaikov commented 1 year ago

@fruch @syuu1228 Verification passed. any other comments?

mykaul commented 1 year ago

Q: what will happen in the next apt-get update? Will we get the 5.19 kernel? Don't we need to do something like 'sudo apt remove linux-generic-hwe*' or something? (probably wrong package here!)

yaronkaikov commented 1 year ago

Q: what will happen in the next apt-get update? Will we get the 5.19 kernel? Don't we need to do something like 'sudo apt remove linux-generic-hwe*' or something? (probably wrong package here!)

@mykaul Looks like it's not needed

scyllaadm@ip-10-99-17-60:~$ sudo apt-get full-upgrade
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
scyllaadm@ip-10-99-17-60:~$ sudo apt-get update
Hit:1 http://us-east-1.ec2.archive.ubuntu.com/ubuntu jammy InRelease
Get:2 http://us-east-1.ec2.archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:3 http://us-east-1.ec2.archive.ubuntu.com/ubuntu jammy-backports InRelease [108 kB]
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]  
Hit:5 https://downloads.scylladb.com/unstable/scylla/master/deb/unified/2023-04-15T03:03:29Z/scylladb-master stable InRelease
Get:6 http://us-east-1.ec2.archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [1030 kB]
Get:7 http://us-east-1.ec2.archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [816 kB]
Get:8 http://us-east-1.ec2.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [902 kB]
Get:9 http://us-east-1.ec2.archive.ubuntu.com/ubuntu jammy-updates/multiverse amd64 Packages [24.1 kB]
Fetched 3109 kB in 1s (4417 kB/s)                        
Reading package lists... Done

doing upgrade and full-upgrade will not update the kernel

syuu1228 commented 1 year ago

@yaronkaikov I realized that when newer aws-lts-22.04 kernel released, apt-get update && apt-get upgrade does not update saved entry. So saved entry works as specific kernel version pinning, users won't get newer lts kernel when the instance rebooted. Is this what we want, or we just want to keep using latest LTS kernel (and allow users to use newer lts kernel when the instance rebooted)?

If we want latter one, I think the solution is not kernel version pinning by grub, we need to drop linux-aws, linux-headers-aws and linux-image-aws metapackages instead (also need to drop non-lts kernel if it already installed).

Something like this:

ubuntu@ip-10-0-1-199:~$ sudo apt purge linux-aws linux-headers-aws linux-image-aws
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following packages will be REMOVED:
  linux-aws* linux-headers-aws* linux-image-aws*
0 upgraded, 0 newly installed, 3 to remove and 0 not upgraded.
After this operation, 36.9 kB disk space will be freed.
Do you want to continue? [Y/n] y

...

ubuntu@ip-10-0-1-199:~$ sudo apt update

...

ubuntu@ip-10-0-1-199:~$ sudo apt install linux-aws-lts-22.04
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  linux-aws-headers-5.15.0-1034 linux-headers-5.15.0-1034-aws
  linux-headers-aws-lts-22.04 linux-image-5.15.0-1034-aws
  linux-image-aws-lts-22.04 linux-modules-5.15.0-1034-aws
Suggested packages:
  fdutils linux-aws-doc-5.15.0 | linux-aws-source-5.15.0 linux-aws-tools
The following NEW packages will be installed:
  linux-aws-headers-5.15.0-1034 linux-aws-lts-22.04
  linux-headers-5.15.0-1034-aws linux-headers-aws-lts-22.04
  linux-image-5.15.0-1034-aws linux-image-aws-lts-22.04
  linux-modules-5.15.0-1034-aws
0 upgraded, 7 newly installed, 0 to remove and 86 not upgraded.
Need to get 49.6 MB of archives.
After this operation, 239 MB of additional disk space will be used.
Do you want to continue? [Y/n] y

ubuntu@ip-10-0-1-199:~$ sudo apt upgrade
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
The following packages will be upgraded:
  apparmor apport apt base-files binutils binutils-common
  binutils-x86-64-linux-gnu ca-certificates cloud-init curl distro-info-data
  fwupd-signed grub-common grub-efi-amd64-bin grub-efi-amd64-signed grub-pc
  grub-pc-bin grub2-common initramfs-tools initramfs-tools-bin
  initramfs-tools-core intel-microcode isc-dhcp-client isc-dhcp-common
  libapparmor1 libapt-pkg6.0 libbinutils libbpf0 libctf-nobfd0 libctf0
  libcurl4 libgnutls30 libgssapi-krb5-2 libk5crypto3 libkrb5-3 libkrb5support0
  libksba8 libldap-2.5-0 libldap-common libnetplan0 libnss-systemd libnss3
  libpam-modules libpam-modules-bin libpam-runtime libpam-systemd libpam0g
  libpython3.10-minimal libpython3.10-stdlib libsasl2-2 libsasl2-modules
  libsasl2-modules-db libssl3 libsystemd0 libudev1 motd-news-config netplan.io
  openssh-client openssh-server openssh-sftp-server openssl python-apt-common
  python3-apport python3-apt python3-distupgrade python3-pkg-resources
  python3-problem-report python3-setuptools python3-tz python3-update-manager
  python3.10 python3.10-minimal shim-signed snapd sudo systemd
  systemd-hwe-hwdb systemd-sysv tar tzdata ubuntu-advantage-tools
  ubuntu-release-upgrader-core udev update-manager-core update-notifier-common
  xxd
86 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
37 standard LTS security updates
Need to get 69.1 MB of archives.
After this operation, 13.0 MB of additional disk space will be used.
Do you want to continue? [Y/n]

After metapackage changes, no non-lts kernel (5.19.x) offered by apt-get upgrade.

mykaul commented 1 year ago

I think @syuu1228 has explained above exactly what I was asking previously - and the solution - to use the LTS packages.

yaronkaikov commented 1 year ago

Verification: AMI: https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/ami/65/ image

GCP: https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/gce-image/25/ image

Azure: https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/azure-image/131/ image

Next-machine-image: https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/next-machine-image/88/

Azure failed on artifact but it's not related to this PR, it's failing for a while, and verified in the logs the kernel is the LTS

mykaul commented 1 year ago

Is that OK?

2023-04-29T17:05:51+00:00 longevity-lwt-3h-2023-1-db-node-8f772d9e-3   !NOTICE | kernel: Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-1035-aws root=PARTUUID=fe426c9e-119b-4a8b-a44c-9da082a00899 ro console=tty1 console=ttyS0 nvme_core.io_timeout=4294967295 net.ifnames=0 clocksource=tsc tsc=reliable panic=-1
2023-04-29T17:05:51+00:00 longevity-lwt-3h-2023-1-db-node-8f772d9e-3   !NOTICE | kernel: Unknown kernel command line parameters "BOOT_IMAGE=/boot/vmlinuz-5.15.0-1035-aws", will be passed to user space.