redhat-openstack / infrared

Plugin based framework that aims to provide an easy-to-use CLI for Ansible based projects
https://infrared.readthedocs.io/en/latest/index.html
Apache License 2.0
99 stars 96 forks source link

Suppress needless introspection re-attempts #413

Open juliakreger opened 2 years ago

juliakreger commented 2 years ago

Introspection has undergone a number of improvements since the early days, including better alignment with Ironic itself, and ultimately fixes in underlying virtual machine configuration for when this tooling is used with VMs to help prevent some issues which do crop up when using VMs to pretend to be baremetal, specifically the enforcement of Spanning Tree.

When this legacy retry code gets invoked, a legitimate error has occured and needs to be investigated. We don't need this code thrasing the environment attempting to force the node through the process, as that just complicates troubleshooting.

This change adds inline notes, and explicitly excludes the retry if the version of OSP is >=16. If nodes have failed, the playbook now also fails the playbook run to enable the environment to be investigated as it failed as opposed to an artificially changed state.

fhubik commented 2 years ago

Right, we are aware of problematic introspection playbook which is being brought up from historical OSP7 times, where it compensated for buggy ironic state at that time and without the retries one basically could not pass introspection stage.

Otoh as introspection stage is critical stage common to all OSP deployments, past or current, afaik we can not affort such change introducing a risk to all CI deployments simultaneously. We plan to wait with such changes (also this one specifically, meaning introspection playbook refactoring) until IR gets osp-branched (which is WIP currently). Then we can finally draw a line between "legacy IR" and current IR state and start fixing all the techdebts in the code (without putting historical parts of CI at risk).

juliakreger commented 2 years ago

IMHO, it is less of a critical stage component than most think. Kind of now only required in un-shipped OSP code for unrelated to deployment activities and for highly specific cases/data/tests (which have paths to navigate if data is not offered/discovered) in already shipped versions. Plus if we operate with the philosophy of fail fast, we will be able to better understand what is occurring leading to less wasted time when the jobs are reconfigured.

I think it is fine to wait, as long as we're aware and pay down this technical debt before we iterate again because it will save us time with each further iteration.