redhat-performance / quads

:calendar: The infrastructure deployment time machine
https://quads.dev
GNU General Public License v3.0
88 stars 36 forks source link

[BUG] validation check failed is always sent prematurely #550

Closed sadsfae closed 2 weeks ago

sadsfae commented 3 weeks ago

This is a much older and minor issue but the validation check notification email is always sent.

Validation check failed for cloud61 / wfoster / 3885

A post allocation test has failed for:

   cloud: cloud61
   owner: wfoster
   ticket: 3885

The release of the environment is contingent on the success
of the validation scripts.  Before users are notified, make
sure to address the environment and ensure all validation steps
succeed to ensure a timely release to the end user.

DevOps Team

Results:
Unable to query Foreman for cloud: cloud61
Verify Foreman password is correct: ***@****

We should not be sending any notification emails from validate_environment phases while in the configurable grace window.

Fix

I think that we should run an pre-emptive quads --foreman-rbac here as in testing this makes sure the message is actually relevant (just tested this with another env)

kambiz-aghaiepour commented 2 weeks ago

Given we only run this out of cron on Monday through Saturday:

-=>>crontab -l | grep foreman
0 */3 * * 1-6 ( echo "=== Foreman RBAC == @ " $(date) ; flock -n /tmp/foremanrbac.lock -c "quads --foreman-rbac" ) 1>/dev/null 2>&1

and new schedules go out on Sundays, maybe we want to also run the cron on Sundays as well? Also, for the hour, */3 means it runs at 0, 3, 6, 9, 12, 15, 18, and 21. The schedules typically start at 22 hours, so the next check won't run until 2 hours after the schedules start, and our grace period for starting the check is 30 minutes after the schedule starts. So with the validation check running at:

*/20 * * * * ( echo "=== Validating == @ " $(date) ; flock -n /tmp/validateenv.lock -c "quads --validate-env" ) 1>>/var/log/validate-env.log 2>&1

it means the next validation run will start at 22:40 and be after the end of the grace period. This is still well before the next rbac run. Depending on the cost of running quads --foreman-rbac, I think we can fix this by setting the rbac cron to run more frequently with:

5 */2 * * * ( echo "=== Foreman RBAC == @ " $(date) ; flock -n /tmp/foremanrbac.lock -c "quads --foreman-rbac" ) 1>/dev/null 2>&1

which should have the rbac run to trigger at 22:05, well before the next validation run.

sadsfae commented 2 weeks ago

@kambiz-aghaiepour running more frequently wouldn't be a bad thing but I think we should also run a pre-emptive quads --foreman-rbac too somewhere that makes sense. Let's just go with the cron adjustments though for now.

I'd want to discuss with @grafuls on how to handle this, but likely we'd want an assignment table field in the database to ensure it's run only once and then just run it out of move_and_rebuild. Like the notification labels this can also be reset when the assignment is reclaimed.

sadsfae commented 2 weeks ago

Currently this message gets generated here in the notify_failure method: https://github.com/redhat-performance/quads/blob/latest/src/quads/tools/validate_env.py#L43 and it will always false-positive unless we're exactly spot on with cron runs (which seems like a bandaid).

If we had an one-time-per-release --foreman-rbac call in M&R before VE gets processed it will avoid getting the failure message entirely rather than increase the frequency from cron.

At that point we may be able to just completely remove the cron.

sadsfae commented 2 weeks ago

Lastly, we want a check that we're not in grace period before notify_failure gets processed anyway: https://github.com/redhat-performance/quads/blob/latest/src/quads/tools/validate_env.py#L43

e.g.

if time_delta.total_seconds() // 60 > Config["validation_grace_period"]:

https://github.com/redhat-performance/quads/blob/latest/conf/quads.yml#L154

sadsfae commented 2 weeks ago

pending patchset: https://review.gerrithub.io/c/redhat-performance/quads/+/1203774