feat: Added support for creating shared LVM setups

japokorn commented 11 months ago

Enhancement: Support for creating shared VGs

Reason: Requested by GFS2

Result:

shared LVM setup needs lvmlockd service with dlm lock manager to be running
to test this change ha_cluster system role is used to set up degenerated cluster on localhost
requires blivet version with shared LVM setup support (https://github.com/storaged-project/blivet/pull/1123)

codecov[bot] commented 11 months ago

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (c4147d2) 13.67% compared to head (7d8b953) 13.65%. Report is 15 commits behind head on main.

Files	Patch %	Lines
library/blivet.py	0.00%	6 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #388 +/- ## ========================================== - Coverage 13.67% 13.65% -0.03% ========================================== Files 8 8 Lines 1733 1736 +3 Branches 79 79 ========================================== Hits 237 237 - Misses 1496 1499 +3 ``` | [Flag](https://app.codecov.io/gh/linux-system-roles/storage/pull/388/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=linux-system-roles) | Coverage Δ | | |---|---|---| | [sanity](https://app.codecov.io/gh/linux-system-roles/storage/pull/388/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=linux-system-roles) | `16.54% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=linux-system-roles#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

richm commented 10 months ago

ping - any updates?

japokorn commented 10 months ago

New blivet version has been released today. I have incorporated all suggestions into the code and uncommented the test.

japokorn commented 10 months ago

We just found out that for some reason the ha_cluster role with configuration in this test fails to set the lvmlockd properly when the test is run locally. If another machine is used it does work. For now I have added a warning message into the test file and will try to resolve this with ha_cluster role people (since our usage of the cluster role is unusual).

richm commented 10 months ago

if you want to test this locally with tox-lsr and qemu - https://linux-system-roles.github.io/contribute.html - "Running tests with tox-lsr and qemu" - I just updated the config file for centos to enable the resilientstorage repo - https://raw.githubusercontent.com/linux-system-roles/linux-system-roles.github.io/master/download/linux-system-roles.json

However, this fails with pcsd error

TASK [fedora.linux_system_roles.ha_cluster : Create a corosync.conf file content using pcs-0.10] ***
task path: /home/rmeggins/linux-system-roles/storage/.tox/ansible_collections/fedora/linux_system_roles/roles/ha_cluster/tasks/shell_pcs/pcs-cluster-setup-pcs-0.10.yml:3
Wednesday 18 October 2023  11:37:35 -0600 (0:00:00.043)       0:01:06.287 ***** 
fatal: [/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2]: FAILED! => {
    "changed": true,
    "cmd": [
        "pcs",
        "cluster",
        "setup",
        "--corosync_conf",
        "/tmp/ansible.4bfisig0_ha_cluster_corosync_conf",
        "--overwrite",
        "--no-cluster-uuid",
        "--",
        "rhel9-1node",
        "/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2"
    ],
    "delta": "0:00:00.519634",
    "end": "2023-10-18 13:37:36.208938",
    "rc": 1,
    "start": "2023-10-18 13:37:35.689304"
}

STDERR:

Warning: Unable to read the known-hosts file: No such file or directory: '/var/lib/pcsd/known-hosts'
No addresses specified for host '/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2', using '/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2'
Error: Unable to resolve addresses: '/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2', use --force to override
Error: Errors have occurred, therefore pcs is unable to continue

The local qemu tests use the qcow file name as the hostname - there is probably some way to provide the explicit address of the local VM

japokorn commented 9 months ago

if you want to test this locally with tox-lsr and qemu - https://linux-system-roles.github.io/contribute.html - "Running tests with tox-lsr and qemu" - I just updated the config file for centos to enable the resilientstorage repo - https://raw.githubusercontent.com/linux-system-roles/linux-system-roles.github.io/master/download/linux-system-roles.json

However, this fails with pcsd error

TASK [fedora.linux_system_roles.ha_cluster : Create a corosync.conf file content using pcs-0.10] ***
task path: /home/rmeggins/linux-system-roles/storage/.tox/ansible_collections/fedora/linux_system_roles/roles/ha_cluster/tasks/shell_pcs/pcs-cluster-setup-pcs-0.10.yml:3
Wednesday 18 October 2023  11:37:35 -0600 (0:00:00.043)       0:01:06.287 ***** 
fatal: [/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2]: FAILED! => {
    "changed": true,
    "cmd": [
        "pcs",
        "cluster",
        "setup",
        "--corosync_conf",
        "/tmp/ansible.4bfisig0_ha_cluster_corosync_conf",
        "--overwrite",
        "--no-cluster-uuid",
        "--",
        "rhel9-1node",
        "/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2"
    ],
    "delta": "0:00:00.519634",
    "end": "2023-10-18 13:37:36.208938",
    "rc": 1,
    "start": "2023-10-18 13:37:35.689304"
}

STDERR:

Warning: Unable to read the known-hosts file: No such file or directory: '/var/lib/pcsd/known-hosts'
No addresses specified for host '/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2', using '/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2'
Error: Unable to resolve addresses: '/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2', use --force to override
Error: Errors have occurred, therefore pcs is unable to continue

The local qemu tests use the qcow file name as the hostname - there is probably some way to provide the explicit address of the local VM

I looks like that the IP address is the only issue preventing the proper run on the localhost. Running the playbook using command like ansible-playbook --connection=local -i localhost, cluster_set.yml (and then trying to create shared VG) will reproduce the issue but ansible-playbook --connection=local -i 127.0.0.1, cluster_set.yml will work as intended.

This issue is not caused by the storage role then.

richm commented 9 months ago

With the suggested fixes, I can run the test up until here on centos-9:

TASK [linux-system-roles.storage : Manage the pools and volumes to match the specified state] ***
task path: /home/rmeggins/linux-system-roles/storage/tests/roles/linux-system-roles.storage/tasks/main-blivet.yml:73
Thursday 09 November 2023  12:58:52 -0700 (0:00:00.017)       0:03:16.483 ***** 
fatal: [/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2]: FAILED! => {
    "actions": [],
    "changed": false,
    "crypts": [],
    "leaves": [],
    "mounts": [],
    "packages": [],
    "pools": [],
    "volumes": []
}
MSG:

failed to set up pool 'vg1': __init__() got an unexpected keyword argument 'shared'

    def _create(self):
        if not self._device:
            members = self._manage_encryption(self._create_members())
            try:
                pool_device = self._blivet.new_vg(name=self._pool['name'], parents=members, shared=self._pool['shared'])
            except Exception as e:
                raise BlivetAnsibleError("failed to set up pool '%s': %s" % (self._pool['name'], str(e)))

what version of blivet has the support for shared? Is it in centos9 yet?

japokorn commented 9 months ago

what version of blivet has the support for shared? Is it in centos9 yet?

I have added the switch that skips the test if needed based on blivet version as per vtrefny https://github.com/linux-system-roles/storage/pull/388#discussion_r1377452558

richm commented 9 months ago

ok - but - is there some platform that has the correct version of blivet? Alternately - if you have some copr blivet build that you are using, can you attach the log output from running the test with the right version of blivet?

japokorn commented 9 months ago

ok - but - is there some platform that has the correct version of blivet? Alternately - if you have some copr blivet build that you are using, can you attach the log output from running the test with the right version of blivet?

I am running the test (not skipped) on Fedora 38 with the latest blivet package (python3-blivet-3.8.2-99.20231127115915812391.3.9.devel.64.gfc7f3fc5.fc38.noarch)

richm commented 9 months ago

[citest]

japokorn commented 9 months ago

[citest]

richm commented 8 months ago

Looks like fedora 39 has the right version of blivet. When I try your latest like this: tox -e qemu-ansible-core-2.15 -- --image-name fedora-39 --log-level debug -- tests/tests_lvm_pool_shared.yml I get this error:

TASK [fedora.linux_system_roles.ha_cluster : Create a corosync.conf file content using pcs-0.10] ***
...
fatal: [/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2]: FAILED! => {
    "changed": true,
    "cmd": [
        "pcs",
        "cluster",
        "setup",
        "--corosync_conf",
        "/tmp/ansible.cjhl1_x4_ha_cluster_corosync_conf",
        "--overwrite",
        "--no-cluster-uuid",
        "--",
        "rhel9-1node",
        "/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2"
    ],
    "delta": "0:00:01.327931",
    "end": "2023-12-06 18:25:28.852939",
    "rc": 1,
    "start": "2023-12-06 18:25:27.525008"
}

STDERR:

Warning: Unable to read the known-hosts file: No such file or directory: '/var/lib/pcsd/known-hosts'
No addresses specified for host '/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2', using '/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2'
Error: Unable to resolve addresses: '/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2', use --force to override
Error: Errors have occurred, therefore pcs is unable to continue

The problem is that runqemu uses the file name of the qcow2 file as the hostname.

richm commented 8 months ago

If I add this to the test:

    - name: Set up test environment for the ha_cluster role
      include_role:
        name: fedora.linux_system_roles.ha_cluster
        tasks_from: test_setup.yml

    - name: Create cluster
...

Then I get much farther, until here:

    - name: >-
        Create a disk device; specify disks as non-list mounted on
        {{ mount_location }}

...

TASK [linux-system-roles.storage : Manage the pools and volumes to match the specified state] ***
...
fatal: [/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2]: FAILED! => {
...
MSG:

Failed to commit changes to disk: Process reported exit code 3:   Using a shared lock type requires lvmlockd (lvm.conf use_lvmlockd.)
  Run `vgcreate --help' for more information.

I guess somewhere in the blivet module or blivet library it manages lvm.conf?

I think we need to change https://github.com/linux-system-roles/ha_cluster/blob/main/tasks/test_setup.yml#L9 to make it more generally applicable.

- name: Set node name to 'localhost' for single-node clusters
  set_fact:
    inventory_hostname: localhost  # noqa: var-naming
  when: ansible_play_hosts_all | length == 1

@tomjelinek @spetrosi I think the intention of this code is - "If inventory_hostname is not resolvable (i.e. is a qcow2 path as used by tox -e qemu, or is some sort of hostaliases like sut as used by baseos ci), then use localhost as it will always be resolvable". The problem is the test "is hostname resolvable" is not easy to do, and even with getent hosts $name, you don't know if the user provided $name as some sort of alias that actually resolved to a real hostname that is incorrect. In Jan's case, he is using an external managed host (not a local qcow2 image file) which has a real, resolvable hostname and IP address that he wants to use. I think we need to introduce a flag like ha_cluster_test_use_given_hostname:

- name: Set node name to 'localhost' for single-node clusters
  set_fact:
    inventory_hostname: localhost  # noqa: var-naming
  when:
    - ansible_play_hosts_all | length == 1
    - not ha_cluster_test_use_given_hostname | d(false)

Then

all tox -e qemu tests, baseos ci, and downstream automated tests will work
Jan can provide -e ha_cluster_test_use_given_hostname=true or otherwise provide this parameter in his inventory when running his tests e.g.

tox -e qemu-ansible-core-2.15 -- --image-name fedora-39 --log-level debug -e ha_cluster_test_use_given_hostname=true -- tests/tests_lvm_pool_shared.yml

wdyt?

tomjelinek commented 8 months ago

@richm You got the intention absolutely right.

Adding the proposed flag works for me. It would be nice if it can be tested (@japokorn ?) before merging it in the ha_cluster role. And a comment explaining the flag is meant for other roles and thus must be kept in place even though it's not used anywhere in ha_cluster role would be helpful. Feel free to open a PR after testing or let me know to do it myself.

richm commented 8 months ago

@tomjelinek there's also an issue with lvmlockd - man lvmlockd

USAGE
   Initial set up
       Setting up LVM to use lvmlockd and a shared VG for the first time includes some one time set up steps:

   1. choose a lock manager
       dlm
       If  dlm  (or  corosync)  are already being used by other cluster software, then select dlm.  dlm uses corosync which requires addi‐
       tional configuration beyond the scope of this document.  See corosync and dlm documentation for instructions on configuration,  set
       up and usage.

how to choose the lock manager? What additional configuration is required by corosync and dlm? Seems like this is something we need to add to the ha_cluster role.

   2. configure hosts to use lvmlockd
       On all hosts running lvmlockd, configure lvm.conf:
       use_lvmlockd = 1

@japokorn where/how is this done? seems like something the storage role/blivet should do?

   3. start lvmlockd
       Start the lvmlockd daemon.
       Use systemctl, a cluster resource agent, or run directly, e.g.
       systemctl start lvmlockd

this seems like something the ha_cluster role should do after it installs lvm2-lockd and dlm.

4. start lock manager
...
       dlm
       Start the dlm and corosync daemons.
       Use systemctl, a cluster resource agent, or run directly, e.g.
       systemctl start corosync dlm

This also seems like something the ha_cluster role should do.

   5. create VG on shared devices
       vgcreate --shared <vgname> <devices>

the storage role does this

   6. start VG on all hosts
       vgchange --lock-start

       Shared VGs must be started before they are used.  Starting the VG performs lock manager initialization that is necessary  to  begin
       using locks (i.e.  creating and joining a lockspace).  Starting the VG may take some time, and until the start completes the VG may
       not be modified or activated.

@japokorn this seems like something the storage role should do?

   7. create and activate LVs
       Standard lvcreate and lvchange commands are used to create and activate LVs in a shared VG.

This also seems like something the storage role should do

   Normal start up and shut down
       After initial set up, start up and shut down include the following steps.  They can be performed directly or may be automated using
       systemd or a cluster resource manager/agents.

       • start lvmlockd
       • start lock manager
       • vgchange --lock-start
       • activate LVs in shared VGs

@tomjelinek this says ". . . may be automated using systemd or a cluster resource manager/agents." - is this something that the ha_cluster role can configure the cluster resource manager/agents to do?

tomjelinek commented 8 months ago

how to choose the lock manager?

Well, the documentation says that dlm should be used if corosync is in use. HA cluster uses corosync.

What additional configuration is required by corosync and dlm? Seems like this is something we need to add to the ha_cluster role.

I'm not aware of any configuration options in corosync related to dlm. And I'm not aware of any required dlm configuration, just run with the defaults.

"... may be automated using systemd or a cluster resource manager/agents." - is this something that the ha_cluster role can configure the cluster resource manager/agents to do?

It means: create cluster resources. So you just need to instruct the ha_cluster role to create the appropriate resources, ocf:pacemaker:controld and ocf:heartbeat:lvmlockd.

richm commented 8 months ago

@tomjelinek afaict the test is setting the appropriate parameters/resources - https://github.com/linux-system-roles/storage/pull/388/files#diff-2892843b9952fe8a2e8f5867b7f5092369acfd8ae20990b1689a366c01b1584cR68-R82

Then maybe the reason it is working in Jan's testing is because he has a "real" hostname and a real IP address, but in the baseos ci and local qemu testing, the inventory_hostname is fake?

tomjelinek commented 8 months ago

@richm Yes, the variables look good. I have verified that the cluster is able to start dlm and lvmlockd resources with no issues with such settings, if it uses a real node name. If the cluster is set up with the 'localhost' node, dlm times out on start. I'm not sure why that happens. I already tried debugging this back in October but I was unable to get any useful info from dlm debug logs.

linux-system-roles / storage

feat: Added support for creating shared LVM setups #388

Codecov Report