redhat-cop / infra.leapp

Collection of Ansible roles for automating RHEL in-place upgrades using Leapp.
MIT License
44 stars 34 forks source link

[RFE] Provide a method to list known findings #123

Closed myllynen closed 3 months ago

myllynen commented 11 months ago

Leapp and infra.leapp preupgrade checks provide crucial findings for systems to be in-place upgraded. These findings must be investigated and a plan to address them in the most appropriate manner is needed. However, after a finding has been assessed and verified and the needed pre and/or post upgrade automation has been created it would be very helpful to allow listing the finding as "known/handled" and not report it for subsequent systems checked later on. This would allow fully concentrating on previously unseen findings or findings that require attention.

Consider a case where an organization has standardized to implement centralized logging with Splunk or endpoint monitoring with Tanium. In such a case it is expected that most if not all systems will have the splunkforwarder and TaniumClient RPMs installed thus when checking the readiness of thousands of systems for in-place upgrades receiving the same warning about these RPMs being installed does not provide much if any additional value. Or the finding about incompatibilities with grep(1) being reported, since grep is installed everywhere, having to observe this finding by infra.leapp for each and every system is not helpful.

The intra.leapp role collection should provide a method for a user to list known/handled findings so that once the user has addressed a finding it can then be listed as known/handled and won't be reported for other systems after that. This could provide notable time savings as when time goes by fewer and fewer findings will be reported as the list of known and handled findings grows.

Here's an example how this was done in a simple but very efficient playbook used for Leapp preupgrade checks:

https://github.com/myllynen/rhel-ipu-automation/blob/master/preupgrade.yml

The implementation in infra.leapp could be of course different but hopefully the above illustrates how such functionality could look like (see especially lines 50-171). Thanks.

swapdisk commented 11 months ago

@myllynen, thank you for this suggestion.

It certainly makes sense to automate remediations for "known" Leapp findings and then to filter out those entries from the Leapp report. It sounds like what you are proposing is to implement such a filtering capability within the infra.leapp analysis role, but I don't see how we could parameterize it sufficiently to anticipate every different requirement enterprise users might have in their complex environments.

I would recommend a couple different options for folks wanting to do this. The first is doing exactly what you have demonstrated in the playbook you cited. That is, implementing the required filtering as code in the playbook that is including that analysis role.

Another way would be to implement the filtering logic in a reporting dashboard. For example, if the Leapp report data is pushed to Splunk, a dashboard could be created to list the pre-upgrade status of hosts in the environment. This is a dashboard I mocked up for another project I'm working on:

image

Now imagine adding "known/handled" and "requires attention" columns to the table. The logic to understand what Leapp report findings would be classified as known or requiring attention could easily be coded into the Splunk query for the dashboard. Furthermore, external data not available locally on a host could be factored in. For example, a finding might be acceptable for an instance tagged in the CMDB as Dev environment, but not allowed for a Prod instance.

Kindly let me know your thoughts. Happy to discuss further if you have more specific ideas?

myllynen commented 11 months ago

Thanks for looking into this.

I fully agree that in large organizations with a huge number of systems and many teams in play a reporting dashboard would sound like the most suitable solution. It would provide the flexibility organizations need and allow all stakeholders to have a holistic view of the project.

However, not every organization has such reporting tools in place and setting one up when upgrading few dozen or hundred servers might be more of an effort than the actual in-place upgrade itself especially if the team lacks related experience. In those cases a simple method such as providing "known/handled" and "requires attention" lists for this role would provide basic but very helpful method to do filtering. That would not be as flexible as full-blown reporting dashboards provide but it would allow teams to deal with Leapp findings more efficiently and avoid forcing them into building reporting dashboards.

Also, this role accepting variables for "known/handled" cases would allow using the standard Ansible inventory/group variables categorizing findings differently for different environments and they do not require any additional configuration or data on a host.

I see these two approaches complementing not competing options. Filtering by this role could be used early on when doing initial discovery in a new environment and in case the findings, the number of systems, or other factors warrant, a proper reporting dashboard could be used (and possibly set up, if one is not in place already) when starting upgrades in scale. In some cases it could also be that these would be used in combination, most unhelpful findings would be filtered out by this role and then more nuanced filtering and reporting would be done using a visually pleasing dashboard.

Thanks.

swapdisk commented 11 months ago

Right, I agree about having complementary options. Like with most things, there's more than one way to get them done and it's good to have different options to choose from.

So you would want to introduce optional filtering variables for the analysis role to provide lists of "known/handled" and "requires attention" pre-upgrade actors, right? Can you spec out design details of what those would look like and the expected actions to be executed by the role when the filtering vars are defined? For starters, maybe draft up how you would document this in the README.md for the analysis role.

Implementing something sufficiently generic to be useful and reusable is probably a non-trivial chunk of work. Are you interested in taking this work on to contribute to the project?

myllynen commented 6 months ago

Unfortunately I haven't been able to find time to try out implementing this and my calendar seems to stay fully booked constantly. I still think having this as an option would definitely make sense but at this stage I don't have free cycles to implement it myself. Thanks.

swapdisk commented 4 months ago

Hi @myllynen, just wanted to check in and see where you are with this. It's been a while since we had the discussion above and there have been a couple changes introduced to the collection that might help.

There's a new check_leapp_analysis_results variable for the upgrade role. When set to false, it tells the role to skip the checking the previous analysis for inhibitors. See PR #162. This allows the approach of running analysis and automation then checking if there are any inhibitors not already known and handled by remediation. If there are unknown inhibitors to the remediation automation, then those need to be addressed, but if not, go straight to upgrade workflow which would include snapshot, remediation, upgrade, etc. This way, no need to run analysis again and it's safe because the leapp upgrade will stop anyways if there are still inhibitors remaining.

We also recently introduced a new remediate role that supports a number of included remediation playbooks that address some specific inhibitors that may found during the analysis. Be aware that the role only supports RHEL 8-9 upgrades at this time.

Kindly let me know if this helps or if you have more ideas.

myllynen commented 3 months ago

I've now created a low-impact PR to implement support for this, nothing changes by default but it allows users optionally to provide a list of known inhibitors. I think this is complementing and providing additional flexibility on top of the other alternatives. Thanks.