linux-system-roles / ha_cluster

Provide automation for Cluster - High Availability management
https://linux-system-roles.github.io/ha_cluster/
MIT License
17 stars 22 forks source link

SUSE Support and crmsh #175

Closed marcelmamula closed 8 months ago

marcelmamula commented 9 months ago

Hello Team,

I am working on sap-linuxlab (community.sap_install) and our plan is to make sure that role sap_ha_pacemaker_cluster, which consumes fedora.linux_system_roles, is correctly working on SUSE systems.

I noticed that groundwork for adoption of non-pcs steps was already done thanks to Sean in #122, so adoption would consist of:

There are few things that I wanted to ask for clarification, before proceeding with any changes in fork:

  1. Are there any ongoing efforts for SUSE support, or is this free range to pickup?
  2. ha_cluster and idempotency?
richm commented 9 months ago

Are there any ongoing efforts for SUSE support, or is this free range to pickup?

Not that I know of - @sean-freeman or @berndfinger might know.

ha_cluster and idempotency?

The role is not currently idempotent for all operations - not sure which are idempotent and which are not - I don't know if you can make the crmsh implementation idempotent without making the pcs implemetation idempotent

sean-freeman commented 9 months ago

@richm I had started porting some code in line with the mentioned Ansible Role draft (keeping to the scope of ha_cluster to accept config and execute, i.e. not adding any of the commands for a specific platform). However there were other priorities in the SAP LinuxLab Open-Source initiative that needed attention first, so I would probably classify this as free range.

I would probably suggest that ha_cluster is psuedo-idempotent in the sense that, if you execute with the same variables you will arrive at the same outcome. However if on multiple runs some variable is changed, it will alter the config file, and the cibadmin will (without mercy or checks) replace any existing config with that new definition.

marcelmamula commented 9 months ago

Thank you both for your comments, I will proceed with with work on my end to see how this can be implemented.

tomjelinek commented 9 months ago

if you execute with the same variables you will arrive at the same outcome. However if on multiple runs some variable is changed, it will alter the config file, and the cibadmin will (without mercy or checks) replace any existing config with that new definition.

That's correct. When you run the role repeatedly with the same variables, you'll get the same outcome and the role should not even restart cluster daemons and thus it should not disrupt cluster operations. However, the role removes any cluster configuration not specified in the role variables. This applies to CIB, corosync and other cluster components. Changing some components' (corosync, sbd) configuration results in restarting those daemons.

marcelmamula commented 8 months ago

@sean-freeman @tomjelinek @berndfinger Lot of idempotency is coming from fact that cluster config will be rebuilt using all those CIB tasks and then remain untouched if they are same.

How would this work combined with sap_install if you re-run them both @sean-freeman ? sap_ha_pacemaker_cluster is where all resources and pacemaker config is created and I dont see anywhere ha_cluster_resource_primitives or __sap_ha_pacemaker_cluster_resource_primitives are defined in examples, so I would assume it would result in CIB being replaced regardless.

If this destructive behavior in combination with sap_ha_pacemaker_cluster is expected, then there is not much value for us in implementing CIB steps under crmsh.

sean-freeman commented 8 months ago

@tomjelinek Thinking about my comment / on re-visit to this GH Issue, it is a good idea to give the end-user a choice on how ha_cluster handles config changes?

The following would warrant a separate GH Issue to be raised.

Perhaps "Feature Request: ha_cluster_config_change_then_halt" would protect an end-user in case of config drift / the variables changed by accident (for whatever reason), and the end-user performs re-run but does not want to replace the working config. By default I imagine such a variable would be false, to preserve the "psuedo-idempotence" and not create a breaking change.

In case such a feature exists already and I cannot see it, please excuse me and please correct me.

As @marcelmamula perceives, and other end-users may also, the behaviour of ha_cluster could be considered "destructive" as there is no safety net in case any variable was changed.


@marcelmamula Best to move any continued discussion specific to SAP, into the sap_install GH Issues or Discussions. Will answer here for posterity.

On it's own, the Ansible Role sap_ha_pacemaker_cluster is relatively benign but does the heavy lifting on inputs/config that should be executed, and all the execution to configure Linux Pacemaker is done by Ansible Role ha_cluster (at the moment some is done from pcs PCMK Shell and other commands at the cib PCMK homogenous binaries level).

The Ansible Role sap_ha_pacemaker_cluster really does 2 things only:

  1. Prepare input variables for Ansible Role ha_cluster to be executed successfully
  2. Prepare OS dependencies (i.e. OS Packages for Fence/Resource Agents and CLIs) required for SAP and the chosen target Infrastructure Platform

The 'private' var (prefix __) referenced __sap_ha_pacemaker_cluster_resource_primitives, is generated on-the-fly based upon an end-user request to configure different SAP Software scenarios (SAP HANA, SAP NetWeaver ABAP ASCS/ERS etc), and a given target Infrastructure Platform (e.g. AWS, GCP, MS AZ, IBM Cloud, IBM PowerVC etc etc).

So the same applies.... if an end-user:

marcelmamula commented 8 months ago

Thank you @sean-freeman for lengthy explanation even though I managed to get to explanations in meantime.

@tomjelinek @sean-freeman I was asking about destructiveness and idempotence mainly because whole cib creation sequence is using pcs working on temporary configuration xml, leaving whole cluster intact. This is not possible with SUSE crm unless you start doing some extensive jinja2 changes without running crm configure

But end is most likely same, because even crm_diff at end of cib build considers all changes as changes and pushes them as patch.

sean-freeman commented 8 months ago

@marcelmamula Can you expand your explanation? From what I investigated (a long time ago), any pcs command has an equal crmsh command?

Looking in the repository, there are a limited number of pcs shell commands in use. See equivilant below for crmsh:

# PCS Shell                   # CRMSH Shell

pcs cluster                   crm node
pcs constraint <type>         crm configure <type>
pcs property                  crm configure property
pcs qdevice                   crm cluster init qdevice
pcs quorum                    corosync-quorumtool / corosync-qnetd-tool
pcs resource                  crm ra
pcs status                    crm status
pcs stonith                   crm ra

Or do you mean it is not possible for crmsh to create/export each command to XML configurations and build the whole configuration to push to cib?

marcelmamula commented 8 months ago

Yes @sean-freeman , My comment was mainly about -f to do changes in xml separate from live cluster cib.

sean-freeman commented 8 months ago

@marcelmamula Isn't 'Shadow CRB usage' the equivalent ?

marcelmamula commented 8 months ago

@sean-freeman You are correct, shadow cib is possible, but it is limited to interactive crm configure mode, because you have to do all changes before commit. I tried using it on command line as ansible would, but shadow was not kept after commands.

Update: I am testing with -c and it seems to be working, I will update if it is usable. crm configure cib new shd crm -c shd configure show

marcelmamula commented 8 months ago

@tomjelinek I have started going through Contribution documentation and I have hit snag with tox availability on SUSE. Would it be fine with doing linting check and leave rest of tests for git actions when PR is submitted?

Also ha_cluster is not using FQDN module names so all new tasks I added contain them to avoid linting errors. Are you planning to switch to FQDN roles in future?

tomjelinek commented 8 months ago

@marcelmamula As long as your code passes CI eventually, it doesn't really matter to me whether you test locally or use git actions in a PR.

I think that not using FQDN names in a role for its own modules is a system roles convention.

@richm Could you comment on these two questions?

marcelmamula commented 8 months ago

@tomjelinek Thank you. My FQDN question was mainly about usual ansible.builtin modules that are triggering warning.

tomjelinek commented 8 months ago

@sean-freeman

Perhaps "Feature Request: ha_cluster_config_change_then_halt" would protect an end-user in case of config drift / the variables changed by accident (for whatever reason), and the end-user performs re-run but does not want to replace the working config. By default I imagine such a variable would be false, to preserve the "psuedo-idempotence" and not create a breaking change.

In case such a feature exists already and I cannot see it, please excuse me and please correct me.

This looks like Ansible check mode to me. The ha_cluster role supports check mode, even though some tasks are impossible to be done in check mode. You also need to know a bit about how the role works, that it builds configuration in temp files and then pushes it to cluster / nodes.

tomjelinek commented 8 months ago

@marcelmamula Sorry for being late to the pcs -f / crm -c discussion. Does the crm -c approach work for you?

I don't think you need to strictly follow pcs at all costs. Some parts of the role are very pcs specific, e.g. pcsd configuration. There was no demand for crmsh support when the role got started, so it wasn't designed with crmsh in mind.

marcelmamula commented 8 months ago

@marcelmamula Sorry for being late to the pcs -f / crm -c discussion. Does the crm -c approach work for you?

I don't think you need to strictly follow pcs at all costs. Some parts of the role are very pcs specific, e.g. pcsd configuration. There was no demand for crmsh support when the role got started, so it wasn't designed with crmsh in mind.

@tomjelinek I have tested out multiple approaches and settled down with crm -c and crm_diff on cib vs shadow. It seems to be working good including rc == 1 check.

Example for cib group:

ansible.builtin.command:
    cmd: |
      crm -c {{ __ha_cluster_crm_shadow }} configure group
      {{ resource_group.id | quote }} {% for resource in resource_group.resource_ids %}
      {{ resource | quote }} {% endfor %} \
      {% if resource_group.meta_attrs[0].attrs | default(False) %}
        meta {% for attr in resource_group.meta_attrs[0].attrs -%}
          {{ attr.name | quote }}={{ attr.value | quote }} {% endfor %}
      {% endif %}

There are some pcs specific like tickets, sets and others so I will leave them as empty placeholders for now or remove completely wherever possible.

richm commented 8 months ago

@tomjelinek I have started going through Contribution documentation and I have hit snag with tox availability on SUSE.

you can't install python pip, then do pip install --user tox, on SUSE?

Would it be fine with doing linting check and leave rest of tests for git actions when PR is submitted?

Yes, I suppose so, but it will be painful to debug, if you find some odd issue, and have to iterate repeatedly by pushing new code to a PR, and cannot reproduce the error locally.

Also ha_cluster is not using FQDN module names so all new tasks I added contain them to avoid linting errors. Are you planning to switch to FQDN roles in future?

In general, you can use the short name for Ansible built-in modules/plugins, and use the FQCN for non-built-in modules/plugins. The Ansible team have told me that short names for built-ins will be supported for the foreseeable future. If/when Ansible changes to require the use of FQCN for everything, we will have tooling to automatically convert everything, because we will have to do that for the other 29 system roles at the same time.

richm commented 8 months ago

@tomjelinek Thank you. My FQDN question was mainly about usual ansible.builtin modules that are triggering warning.

That should not be happening with ansible-lint because we suppress that warning - https://github.com/linux-system-roles/ha_cluster/blob/main/.ansible-lint#L16 What is the message you are seeing? Is it from ansible-lint?

marcelmamula commented 8 months ago

@richm @tomjelinek FQDN: It seems my lint config was inherited from top folder, which does not have fqcn-builtins. Do you have any issue with me keeping all new files with fqdn?

Tox/Testing: pip3 tox can be installed fine, issue starts with Use yum or dnf to install standard-test-roles-inventory-qemu since this is in fedora repo only. That package seems to be requirement to actually test using tox.

richm commented 8 months ago

@richm @tomjelinek FQDN: It seems my lint config was inherited from top folder,

Not sure what you mean by that.

which does not have fqcn-builtins. Do you have any issue with me keeping all new files with fqdn?

Yes, you can use ansible.builtin.xxx instead of xxx. That should not cause any issues.

Tox/Testing: pip3 tox can be installed fine, issue starts with Use yum or dnf to install since this is in fedora repo only. That package seems to be requirement to actually test using tox.

It is a requirement to use qemu. But it is possible to install the qemu and kvm and seabios packages separately. So if you want to be able to run

tox -e qemu-ansible-core-2.16 -- --image-file /path/to/suse.qcow2 tests/tests_crmsh.yml

then you will need to install the qemu, kvm, seabios, whatever other packages you need to run qcow2 based VMs on your development platform.

You do not need standard-test-roles-inventory-qemu in order to run tox -e collection,ansible-lint-collection

marcelmamula commented 8 months ago

Thank you @sean-freeman @richm @tomjelinek for https://github.com/linux-system-roles/ha_cluster/pull/186. I will close this issue and further changes will go separately from this original issue.