Closed japokorn closed 8 months ago
Attention: 6 lines
in your changes are missing coverage. Please review.
Comparison is base (
c4147d2
) 13.67% compared to head (7d8b953
) 13.65%. Report is 15 commits behind head on main.
Files | Patch % | Lines |
---|---|---|
library/blivet.py | 0.00% | 6 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
ping - any updates?
New blivet version has been released today. I have incorporated all suggestions into the code and uncommented the test.
We just found out that for some reason the ha_cluster role with configuration in this test fails to set the lvmlockd properly when the test is run locally. If another machine is used it does work. For now I have added a warning message into the test file and will try to resolve this with ha_cluster role people (since our usage of the cluster role is unusual).
if you want to test this locally with tox-lsr and qemu - https://linux-system-roles.github.io/contribute.html - "Running tests with tox-lsr and qemu" - I just updated the config file for centos to enable the resilientstorage repo - https://raw.githubusercontent.com/linux-system-roles/linux-system-roles.github.io/master/download/linux-system-roles.json
However, this fails with pcsd error
TASK [fedora.linux_system_roles.ha_cluster : Create a corosync.conf file content using pcs-0.10] ***
task path: /home/rmeggins/linux-system-roles/storage/.tox/ansible_collections/fedora/linux_system_roles/roles/ha_cluster/tasks/shell_pcs/pcs-cluster-setup-pcs-0.10.yml:3
Wednesday 18 October 2023 11:37:35 -0600 (0:00:00.043) 0:01:06.287 *****
fatal: [/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2]: FAILED! => {
"changed": true,
"cmd": [
"pcs",
"cluster",
"setup",
"--corosync_conf",
"/tmp/ansible.4bfisig0_ha_cluster_corosync_conf",
"--overwrite",
"--no-cluster-uuid",
"--",
"rhel9-1node",
"/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2"
],
"delta": "0:00:00.519634",
"end": "2023-10-18 13:37:36.208938",
"rc": 1,
"start": "2023-10-18 13:37:35.689304"
}
STDERR:
Warning: Unable to read the known-hosts file: No such file or directory: '/var/lib/pcsd/known-hosts'
No addresses specified for host '/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2', using '/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2'
Error: Unable to resolve addresses: '/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2', use --force to override
Error: Errors have occurred, therefore pcs is unable to continue
The local qemu tests use the qcow file name as the hostname - there is probably some way to provide the explicit address of the local VM
if you want to test this locally with tox-lsr and qemu - https://linux-system-roles.github.io/contribute.html - "Running tests with tox-lsr and qemu" - I just updated the config file for centos to enable the resilientstorage repo - https://raw.githubusercontent.com/linux-system-roles/linux-system-roles.github.io/master/download/linux-system-roles.json
However, this fails with pcsd error
TASK [fedora.linux_system_roles.ha_cluster : Create a corosync.conf file content using pcs-0.10] *** task path: /home/rmeggins/linux-system-roles/storage/.tox/ansible_collections/fedora/linux_system_roles/roles/ha_cluster/tasks/shell_pcs/pcs-cluster-setup-pcs-0.10.yml:3 Wednesday 18 October 2023 11:37:35 -0600 (0:00:00.043) 0:01:06.287 ***** fatal: [/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2]: FAILED! => { "changed": true, "cmd": [ "pcs", "cluster", "setup", "--corosync_conf", "/tmp/ansible.4bfisig0_ha_cluster_corosync_conf", "--overwrite", "--no-cluster-uuid", "--", "rhel9-1node", "/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2" ], "delta": "0:00:00.519634", "end": "2023-10-18 13:37:36.208938", "rc": 1, "start": "2023-10-18 13:37:35.689304" } STDERR: Warning: Unable to read the known-hosts file: No such file or directory: '/var/lib/pcsd/known-hosts' No addresses specified for host '/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2', using '/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2' Error: Unable to resolve addresses: '/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2', use --force to override Error: Errors have occurred, therefore pcs is unable to continue
The local qemu tests use the qcow file name as the hostname - there is probably some way to provide the explicit address of the local VM
I looks like that the IP address is the only issue preventing the proper run on the localhost.
Running the playbook using command like ansible-playbook --connection=local -i localhost, cluster_set.yml
(and then trying to create shared VG) will reproduce the issue but ansible-playbook --connection=local -i 127.0.0.1, cluster_set.yml
will work as intended.
This issue is not caused by the storage role then.
With the suggested fixes, I can run the test up until here on centos-9:
TASK [linux-system-roles.storage : Manage the pools and volumes to match the specified state] ***
task path: /home/rmeggins/linux-system-roles/storage/tests/roles/linux-system-roles.storage/tasks/main-blivet.yml:73
Thursday 09 November 2023 12:58:52 -0700 (0:00:00.017) 0:03:16.483 *****
fatal: [/home/rmeggins/.cache/linux-system-roles/centos-9.qcow2]: FAILED! => {
"actions": [],
"changed": false,
"crypts": [],
"leaves": [],
"mounts": [],
"packages": [],
"pools": [],
"volumes": []
}
MSG:
failed to set up pool 'vg1': __init__() got an unexpected keyword argument 'shared'
def _create(self):
if not self._device:
members = self._manage_encryption(self._create_members())
try:
pool_device = self._blivet.new_vg(name=self._pool['name'], parents=members, shared=self._pool['shared'])
except Exception as e:
raise BlivetAnsibleError("failed to set up pool '%s': %s" % (self._pool['name'], str(e)))
what version of blivet has the support for shared
? Is it in centos9 yet?
what version of blivet has the support for
shared
? Is it in centos9 yet?
I have added the switch that skips the test if needed based on blivet version as per vtrefny https://github.com/linux-system-roles/storage/pull/388#discussion_r1377452558
ok - but - is there some platform that has the correct version of blivet? Alternately - if you have some copr blivet build that you are using, can you attach the log output from running the test with the right version of blivet?
ok - but - is there some platform that has the correct version of blivet? Alternately - if you have some copr blivet build that you are using, can you attach the log output from running the test with the right version of blivet?
I am running the test (not skipped) on Fedora 38 with the latest blivet package (python3-blivet-3.8.2-99.20231127115915812391.3.9.devel.64.gfc7f3fc5.fc38.noarch
)
[citest]
[citest]
Looks like fedora 39 has the right version of blivet. When I try your latest like this:
tox -e qemu-ansible-core-2.15 -- --image-name fedora-39 --log-level debug -- tests/tests_lvm_pool_shared.yml
I get this error:
TASK [fedora.linux_system_roles.ha_cluster : Create a corosync.conf file content using pcs-0.10] ***
...
fatal: [/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2]: FAILED! => {
"changed": true,
"cmd": [
"pcs",
"cluster",
"setup",
"--corosync_conf",
"/tmp/ansible.cjhl1_x4_ha_cluster_corosync_conf",
"--overwrite",
"--no-cluster-uuid",
"--",
"rhel9-1node",
"/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2"
],
"delta": "0:00:01.327931",
"end": "2023-12-06 18:25:28.852939",
"rc": 1,
"start": "2023-12-06 18:25:27.525008"
}
STDERR:
Warning: Unable to read the known-hosts file: No such file or directory: '/var/lib/pcsd/known-hosts'
No addresses specified for host '/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2', using '/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2'
Error: Unable to resolve addresses: '/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2', use --force to override
Error: Errors have occurred, therefore pcs is unable to continue
The problem is that runqemu uses the file name of the qcow2 file as the hostname.
If I add this to the test:
- name: Set up test environment for the ha_cluster role
include_role:
name: fedora.linux_system_roles.ha_cluster
tasks_from: test_setup.yml
- name: Create cluster
...
Then I get much farther, until here:
- name: >-
Create a disk device; specify disks as non-list mounted on
{{ mount_location }}
...
TASK [linux-system-roles.storage : Manage the pools and volumes to match the specified state] ***
...
fatal: [/home/rmeggins/.cache/linux-system-roles/fedora-39.qcow2]: FAILED! => {
...
MSG:
Failed to commit changes to disk: Process reported exit code 3: Using a shared lock type requires lvmlockd (lvm.conf use_lvmlockd.)
Run `vgcreate --help' for more information.
I guess somewhere in the blivet module or blivet library it manages lvm.conf?
I think we need to change https://github.com/linux-system-roles/ha_cluster/blob/main/tasks/test_setup.yml#L9 to make it more generally applicable.
- name: Set node name to 'localhost' for single-node clusters
set_fact:
inventory_hostname: localhost # noqa: var-naming
when: ansible_play_hosts_all | length == 1
@tomjelinek @spetrosi I think the intention of this code is - "If inventory_hostname
is not resolvable (i.e. is a qcow2 path as used by tox -e qemu
, or is some sort of hostaliases like sut
as used by baseos ci), then use localhost
as it will always be resolvable". The problem is the test "is hostname resolvable" is not easy to do, and even with getent hosts $name
, you don't know if the user provided $name
as some sort of alias that actually resolved to a real hostname that is incorrect. In Jan's case, he is using an external managed host (not a local qcow2 image file) which has a real, resolvable hostname and IP address that he wants to use. I think we need to introduce a flag like ha_cluster_test_use_given_hostname
:
- name: Set node name to 'localhost' for single-node clusters
set_fact:
inventory_hostname: localhost # noqa: var-naming
when:
- ansible_play_hosts_all | length == 1
- not ha_cluster_test_use_given_hostname | d(false)
Then
tox -e qemu
tests, baseos ci, and downstream automated tests will work-e ha_cluster_test_use_given_hostname=true
or otherwise provide this parameter in his inventory when running his tests e.g.tox -e qemu-ansible-core-2.15 -- --image-name fedora-39 --log-level debug -e ha_cluster_test_use_given_hostname=true -- tests/tests_lvm_pool_shared.yml
wdyt?
@richm You got the intention absolutely right.
Adding the proposed flag works for me. It would be nice if it can be tested (@japokorn ?) before merging it in the ha_cluster role. And a comment explaining the flag is meant for other roles and thus must be kept in place even though it's not used anywhere in ha_cluster role would be helpful. Feel free to open a PR after testing or let me know to do it myself.
@tomjelinek there's also an issue with lvmlockd - man lvmlockd
USAGE
Initial set up
Setting up LVM to use lvmlockd and a shared VG for the first time includes some one time set up steps:
1. choose a lock manager
dlm
If dlm (or corosync) are already being used by other cluster software, then select dlm. dlm uses corosync which requires addi‐
tional configuration beyond the scope of this document. See corosync and dlm documentation for instructions on configuration, set
up and usage.
how to choose the lock manager? What additional configuration is required by corosync
and dlm
? Seems like this is something we need to add to the ha_cluster
role.
2. configure hosts to use lvmlockd
On all hosts running lvmlockd, configure lvm.conf:
use_lvmlockd = 1
@japokorn where/how is this done? seems like something the storage role/blivet should do?
3. start lvmlockd
Start the lvmlockd daemon.
Use systemctl, a cluster resource agent, or run directly, e.g.
systemctl start lvmlockd
this seems like something the ha_cluster
role should do after it installs lvm2-lockd
and dlm
.
4. start lock manager
...
dlm
Start the dlm and corosync daemons.
Use systemctl, a cluster resource agent, or run directly, e.g.
systemctl start corosync dlm
This also seems like something the ha_cluster
role should do.
5. create VG on shared devices
vgcreate --shared <vgname> <devices>
the storage role does this
6. start VG on all hosts
vgchange --lock-start
Shared VGs must be started before they are used. Starting the VG performs lock manager initialization that is necessary to begin
using locks (i.e. creating and joining a lockspace). Starting the VG may take some time, and until the start completes the VG may
not be modified or activated.
@japokorn this seems like something the storage role should do?
7. create and activate LVs
Standard lvcreate and lvchange commands are used to create and activate LVs in a shared VG.
This also seems like something the storage role should do
Normal start up and shut down
After initial set up, start up and shut down include the following steps. They can be performed directly or may be automated using
systemd or a cluster resource manager/agents.
• start lvmlockd
• start lock manager
• vgchange --lock-start
• activate LVs in shared VGs
@tomjelinek this says ". . . may be automated using systemd or a cluster resource manager/agents." - is this something that the ha_cluster
role can configure the cluster resource manager/agents to do?
how to choose the lock manager?
Well, the documentation says that dlm should be used if corosync is in use. HA cluster uses corosync.
What additional configuration is required by corosync and dlm? Seems like this is something we need to add to the ha_cluster role.
I'm not aware of any configuration options in corosync related to dlm. And I'm not aware of any required dlm configuration, just run with the defaults.
"... may be automated using systemd or a cluster resource manager/agents." - is this something that the ha_cluster role can configure the cluster resource manager/agents to do?
It means: create cluster resources. So you just need to instruct the ha_cluster role to create the appropriate resources, ocf:pacemaker:controld and ocf:heartbeat:lvmlockd.
@tomjelinek afaict the test is setting the appropriate parameters/resources - https://github.com/linux-system-roles/storage/pull/388/files#diff-2892843b9952fe8a2e8f5867b7f5092369acfd8ae20990b1689a366c01b1584cR68-R82
Then maybe the reason it is working in Jan's testing is because he has a "real" hostname and a real IP address, but in the baseos ci and local qemu testing, the inventory_hostname
is fake?
@richm Yes, the variables look good. I have verified that the cluster is able to start dlm and lvmlockd resources with no issues with such settings, if it uses a real node name. If the cluster is set up with the 'localhost' node, dlm times out on start. I'm not sure why that happens. I already tried debugging this back in October but I was unable to get any useful info from dlm debug logs.
Enhancement: Support for creating shared VGs
Reason: Requested by GFS2
Result: