oamg / leapp-repository

Leapp repositories containing actors for the Leapp framework (https://github.com/oamg/leapp). Currently provides leapp repositories for in-place upgrades of RHEL systems.
Apache License 2.0
48 stars 144 forks source link

boot: check first partition offset on GRUB devices #1195

Closed MichalHe closed 4 months ago

MichalHe commented 5 months ago

Check that the first partition starts at least at 1MiB (2048 cylinders), as too small first-partition offsets lead to failures when doing grub2-install. The limit (1MiB) has been chosen as it is a common value set by the disk formatting tools nowadays.

Tips to reproduce

  1. Use fdisk -c=dos -u=cylinders </dev/drive> to force fdisk to allow you to format the target disk in a way that the first partition starts at an offset <= 1MiB
  2. Install rhel on the disk (tested with rhel7.9)

jira: https://issues.redhat.com/browse/RHEL-3341

github-actions[bot] commented 5 months ago

Thank you for contributing to the Leapp project!

Please note that every PR needs to comply with the Leapp Guidelines and must pass all tests in order to be mergeable. If you want to request a review or rebuild a package in copr, you can use following commands as a comment:

Packit will automatically schedule regression tests for this PR's build and latest upstream leapp build. If you need a different version of leapp, e.g. from PR#42, use /packit test oamg/leapp#42 Note that first time contributors cannot run tests automatically - they will be started by a reviewer.

It is possible to schedule specific on-demand tests as well. Currently 2 test sets are supported, beaker-minimal and kernel-rt, both can be used to be run on all upgrade paths or just a couple of specific ones. To launch on-demand tests with packit:

See other labels for particular jobs defined in the .packit.yaml file.

Please open ticket in case you experience technical problem with the CI. (RH internal only)

Note: In case there are problems with tests not being triggered automatically on new PR/commit or pending for a long time, please contact leapp-infra.

pirat89 commented 5 months ago

Why parted instead of fdisk? Smaller rpm (but both seem to be preinstalled).

@MichalHe parted does not have to be installed on the system and can be uninstalled. however, fdisk is part of util-linux and util-linux package must be always installed, so fdisk is always present. From this point, fdisk is better option as we do not increase the surface of dependencies - and in this project, we want to limit introduction of new dependencies unless they are necessary. However, I am ok to be less strict in this case as parted is really by default present - at least on some systems.

# dnf remove parted
Updating Subscription Management repositories.
Dependencies resolved.
==============================================================================================================================================================================================================================================
 Package                                                 Architecture                                            Version                                                       Repository                                                Size
==============================================================================================================================================================================================================================================
Removing:
 parted                                                  x86_64                                                  3.2-39.el8                                                    @System                                                  2.0 M

Transaction Summary
==============================================================================================================================================================================================================================================
Remove  1 Package

Freed space: 2.0 M
Is this ok [y/N]: n
Operation aborted.

[root@localhost ~]# rpm -qf `which fdisk`
util-linux-2.32.1-44.el8_9.1.x86_64
[root@localhost ~]# dnf remove util-linux-2.32.1-44.el8_9.1.x86_64
Updating Subscription Management repositories.
Error: 
 Problem: The operation would result in removing the following protected packages: systemd
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)
MichalHe commented 5 months ago

@pirat89 I have rewritten the code to use fdisk instead. I am against introducing needless dependencies and taking shortcuts.

Beware that I have not tested the new solution yet.

MichalHe commented 4 months ago

@pirat89 I've manually tested the patch on RHEL7. On a VM with a disk formatted in a way so that the offset of the first partition is 63 sectors (the old way causing problems) the upgrade is inhibited due to the offset of the first partition being too small. With our vagrant box the upgrade was not inhibited (the offset of the first partition is 2048sectors (=1M)).

@oamg/developers This code is ready for a review, imho.

pirat89 commented 4 months ago

The solution works, tested on my upgraded 6 -> 7 system:

============================================================
                      REPORT OVERVIEW                       
============================================================

Upgrade has been inhibited due to the following problems:
    ...
    3. Found GRUB devices with too little space reserved before the first partition
    ...
...

### report:
Risk Factor: high (inhibitor)
Title: Found GRUB devices with too little space reserved before the first partition
Summary: GRUB devices with too little space reserved before the first partition were found. Such devices can present a problem when updating the boot loader. Problematic devices:
- /dev/vda
Remediation: [hint] Reformat the devices so that there is at least 1024 kiB space before the first partition.
Key: 98eb0d6d72263cb21ac172fc6a612e27e48a9e91
pirat89 commented 4 months ago

Seems good:

Risk Factor: high (inhibitor)
Title: Found GRUB devices with too little space reserved before the first partition
Summary: On the system booting by using BIOS, the in-place upgrade fails when upgrading the GRUB2 bootloader if the boot disk's embedding area does not contain enough space for the core image installation. This results in a broken system, and can occur when the disk has been partitioned manually, for example using the RHEL 6 fdisk utility.

The list of devices with small embedding area:
- /dev/vda.
Remediation: [hint] We recommend to perform a fresh installation of the RHEL 8 system instead of performing the in-place upgrade.
Another possibility is to reformat the devices so that there is at least 1024 kiB space before the first partition. Note that this operation is not supported and does not have to be always possible.
Key: 98eb0d6d72263cb21ac172fc6a612e27e48a9e91

I am going to execute this yet on IBM Z and efi (as I have them by hand now) and merge it if passing. And possibly after tests are finished. Thanks!!!

pirat89 commented 4 months ago

@MichalHe I have bad karma I guess. It seems that CI caught FP inhibitor on AWS. we should check that. Checking logs, it seems to me that everything should ok, but I see the report.

pirat89 commented 4 months ago

The last one? :) Actually, the GPT (and the prepboot in general) is really the trouble maker in the end.

2024-04-23 20:01:03.901 INFO     PID: 11755 leapp.workflow.Checks: Executing actor check_first_partition_offset 
Process Process-386:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/site-packages/leapp/repository/actor_definition.py", line 74, in _do_run
    actor_instance.run(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/leapp/actors/__init__.py", line 289, in run
    self.process(*args)
  File "/usr/share/leapp-repository/repositories/system_upgrade/el7toel8/actors/checkfirstpartitionoffset/actor.py", line 24, in process
    check_first_partition_offset.check_first_partition_offset()
  File "/usr/share/leapp-repository/repositories/system_upgrade/el7toel8/actors/checkfirstpartitionoffset/libraries/check_first_partition_offset.py", line 19, in check_first_partition_offset
    first_partition = min(grub_dev.partitions, key=lambda partition: partition.start_offset)
ValueError: min() arg is an empty sequence
2024-04-23 20:01:03.952 ERROR    PID: 11755 leapp.workflow.Checks: Actor check_first_partition_offset has crashed: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/leapp/repository/actor_definition.py", line 74, in _do_run
    actor_instance.run(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/leapp/actors/__init__.py", line 289, in run
    self.process(*args)
  File "/usr/share/leapp-repository/repositories/system_upgrade/el7toel8/actors/checkfirstpartitionoffset/actor.py", line 24, in process
    check_first_partition_offset.check_first_partition_offset()
  File "/usr/share/leapp-repository/repositories/system_upgrade/el7toel8/actors/checkfirstpartitionoffset/libraries/check_first_partition_offset.py", line 19, in check_first_partition_offset
    first_partition = min(grub_dev.partitions, key=lambda partition: partition.start_offset)
ValueError: min() arg is an empty sequence

The msg has no partitions at all:

----------------------------------------------------------------------
Stamp: 2024-04-23T20:00:59.709122Z
Actor: scan_grub_device_partition_layout
Phase: FactsCollection
Type: GRUBDevicePartitionLayout
Message_data:
{
    "device": "/dev/xvda",
    "partitions": []
}
----------------------------------------------------------------------

In the actor we can see that it's caused by using GPT (and prep boot partition) instead of MBR for BIOS boot.

    2024-04-23 20:00:59.682 DEBUG    PID: 15396 leapp.workflow.FactsCollection.scan_grub_device_partition_layout: External command has started: ['fdisk', '-l', '-u=sectors', u'/dev/xvda']
    2024-04-23 20:00:59.689 DEBUG    PID: 15396 leapp.workflow.FactsCollection.scan_grub_device_partition_layout: WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.
    2024-04-23 20:00:59.691 DEBUG    PID: 15396 leapp.workflow.FactsCollection.scan_grub_device_partition_layout: 
    2024-04-23 20:00:59.692 DEBUG    PID: 15396 leapp.workflow.FactsCollection.scan_grub_device_partition_layout: Disk /dev/xvda: 268.4 GB, 268435456000 bytes, 524288000 sectors
    2024-04-23 20:00:59.693 DEBUG    PID: 15396 leapp.workflow.FactsCollection.scan_grub_device_partition_layout: Units = sectors of 1 * 512 = 512 bytes
    2024-04-23 20:00:59.695 DEBUG    PID: 15396 leapp.workflow.FactsCollection.scan_grub_device_partition_layout: Sector size (logical/physical): 512 bytes / 512 bytes
    2024-04-23 20:00:59.696 DEBUG    PID: 15396 leapp.workflow.FactsCollection.scan_grub_device_partition_layout: I/O size (minimum/optimal): 512 bytes / 512 bytes
    2024-04-23 20:00:59.697 DEBUG    PID: 15396 leapp.workflow.FactsCollection.scan_grub_device_partition_layout: Disk label type: gpt
    2024-04-23 20:00:59.699 DEBUG    PID: 15396 leapp.workflow.FactsCollection.scan_grub_device_partition_layout: Disk identifier: 2D3B54F1-2833-4F75-9BB7-C3A39B53EE7C
    2024-04-23 20:00:59.699 DEBUG    PID: 15396 leapp.workflow.FactsCollection.scan_grub_device_partition_layout: 
    2024-04-23 20:00:59.700 DEBUG    PID: 15396 leapp.workflow.FactsCollection.scan_grub_device_partition_layout: 
    2024-04-23 20:00:59.701 DEBUG    PID: 15396 leapp.workflow.FactsCollection.scan_grub_device_partition_layout: #         Start          End    Size  Type            Name
    2024-04-23 20:00:59.703 DEBUG    PID: 15396 leapp.workflow.FactsCollection.scan_grub_device_partition_layout:  1         2048         4095      1M  BIOS boot       
    2024-04-23 20:00:59.704 DEBUG    PID: 15396 leapp.workflow.FactsCollection.scan_grub_device_partition_layout:  2         4096    524287965    250G  Linux filesyste 
    2024-04-23 20:00:59.707 DEBUG    PID: 15396 leapp.workflow.FactsCollection.scan_grub_device_partition_layout: External command has finished: ['fdisk', '-l', '-u=sectors', u'/dev/xvda']

As it is affecting even the CI, let's fix it before the merge too.

pirat89 commented 4 months ago

Created separate commit to handle the previousl described issue. The solution has 2 parts:

As described here:

Btw, disks with GPT could be also reported in some cases as "dos" instead:

I have also exteded a little bit original test cases, but the current new solution is not completely covered by unit-tests. I am testing this manually on my setups right now + CI set. So I believe we could call it good enough if all tests will pass. The early deliver allow us to discover possible issues sooner, so I am in favor of merging this under the current circumstances when we do not expect so much to extend the solution in future unless we find a good reason for it.

pirat89 commented 4 months ago

My manual testing results after the last changes: