Don't fail region workflows if heavy-io fails

DomAyre commented 1 month ago

The Heavy IO workload exposes a pretty consistent issue in Confidential ACI, albeit a niche one. Therefore we would like to continue running this workload to have visibility but we don't really want to make the whole dashboard red because of this.

To solve this, we want to run the workload, but not fail the caller 'region' workflow if it fails. This allows us to continue collecting data but without red-ing out our dashboard. We also need to ensure that it doesn't make the workflow entirely green

DomAyre commented 1 month ago

Here's a manual run of the region workload to see if failures in heavy-io cause whole workflow failure https://github.com/microsoft/confidential-aci-dashboard/actions/runs/10846966157

DomAyre commented 1 month ago

Here's a manual run of the individual workflow https://github.com/microsoft/confidential-aci-dashboard/actions/runs/10847111652

DomAyre commented 1 month ago

This solution works because historical results are tracked on a per job basis rather than per workflow. So the job still shows as failed even if the workflow reports success.

I have manually tested this running ./scripts/get_workload_region_results.sh with the individual workflow run mentioned above (I modified the script to show results not on main):

@DomAyre ➜ /workspaces/confidential-aci-dashboard (unblock-region-workloads) $ ./scripts/get_workload_region_results.sh heavy-io westeurope 2024-09-13T10:01:00Z
Getting results for:
  Workload: heavy-io
  Region: westeurope
  Since: 2024-09-13T10:01:00Z
Conclusion,URL,Date,Failing Step
✗ Failure,https://github.com/microsoft/confidential-aci-dashboard/actions/runs/10847111652/job/30101497003,2024-09-13T10:05:44Z,None

microsoft / confidential-aci-dashboard

Don't fail region workflows if heavy-io fails #64