openstack-k8s-operators / edpm-ansible

External Dataplane Management Ansible Playbooks
https://openstack-k8s-operators.github.io/edpm-ansible/
Apache License 2.0
9 stars 66 forks source link

Add edpm_reboot_strategy to reboot role #646

Closed ciecierski closed 5 months ago

ciecierski commented 6 months ago

Add edpm_reboot_strategy to reboot role that defines if reboot should be started or not. There are three modes: auto, never and force. The default is auto.

Add also more generic check for existence of openstack configs for already deployed nodes, both old tripleo and new edpm node.

jistr commented 5 months ago

/lgtm

jistr commented 5 months ago

/lgtm

jistr commented 5 months ago

/approve

jistr commented 5 months ago

What is Tide waiting for here in terms of approval? Does it need to be approved by @SeanMooney specifically as the comment above suggests? :)

jistr commented 5 months ago

Ah i guess Mikolaj or i would have to be on the approvers list.

SeanMooney commented 5 months ago

no it just requries approval form anyone in the owners file https://github.com/openstack-k8s-operators/edpm-ansible/blob/main/OWNERS or OWNERS_ALIASES https://github.com/openstack-k8s-operators/edpm-ansible/blob/main/OWNERS_ALIASES

fao89 commented 5 months ago

/approve

ciecierski commented 5 months ago

If there are concerns how this change works for adoption I link pr to dataplane-adoption repo that tests the change in pre-adoption phase. https://github.com/openstack-k8s-operators/data-plane-adoption/pull/459/files

SeanMooney commented 5 months ago

So that pr is not depending on this pr so it's not testing it and it's also failing ci which is not very encouraging

I'll remove the requested change to allow others to approve but I'm still unsure that how reboot are being done are correct

jistr commented 5 months ago

first is there a generic adoption task planned to clean up tripleo config? by default we likely should be avoiding remoood for edpm nodes with nova but I'm not sure its safe to force the reboot for fpis mode by default.

I think we're stopping and removing the services, but we're not presently cleaning up the config? Anyway i think we need some way to check for "is this a node that's already been used in the deployment to host some workload or networker etc." and then block the reboot. I don't have a better idea than what Mikolaj posted. For reference here is the generic cleanup PR:

https://github.com/openstack-k8s-operators/edpm-ansible/pull/649

so this will mean we will never reboot

Isn't that the goal though? I thought the os-reboot role is to apply node config (that needs reboot to apply itself) before the first use of that node. I don't think we wanted to use os-reboot role to reboot nodes that are already being utilized (hosting workloads / networking / storage)?

SeanMooney commented 5 months ago

the os-reboot will be used for kernel updates or other changes that require a host reboot as part of day two maintenance.

it feels wrong to require force to be used in that case auto should work asuimg a prior task has marked the node as needing reboot.

the intended workflow is that you would do a separate deployment including the os-reboot service with --limit to reboot subsets of host that you have drained of workloads in batches.

in that workflow we should not need to update the ansible_vars on the nodeset to force.

if we cant support that usecase we cant do minor updates properly.

openshift-ci[bot] commented 5 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ciecierski, fao89, jistr, SeanMooney

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openstack-k8s-operators/edpm-ansible/blob/main/OWNERS)~~ [SeanMooney,fao89] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
SeanMooney commented 5 months ago

i will file a jira bug for the fips issue

ciecierski commented 5 months ago

So that pr is not depending on this pr so it's not testing it and it's also failing ci which is not very encouraging

Yes, the pr wasn't testing it. I made a patch, but there was missing depends-on. I fixed it and new results are https://github.com/openstack-k8s-operators/data-plane-adoption/pull/459#issuecomment-2133562264 A non-ceph job failed during adoption, it passed pre-adoption validation with reboot.

SeanMooney commented 5 months ago

filed OSPRH-7301