kata-containers / kata-containers

Kata Containers is an open source project and community working to build a standard implementation of lightweight Virtual Machines (VMs) that feel and perform like containers, but provide the workload isolation and security advantages of VMs. https://katacontainers.io/
Apache License 2.0
5.09k stars 1.01k forks source link

"Tailorbird" initiative: making CI your friend #9506

Open wainersm opened 2 months ago

wainersm commented 2 months ago

Context

On Virtual Kata Containers PTG Planning of April 2024 there was a discussion session lead by @jodh-intel on regarding the current problems that Kata developers have faced with CI. Please, see the topics and notes of that session in https://etherpad.opendev.org/p/kata-ptg-planning-april-2024#L160 . We ended the session with a list of volunteers (myself, @ldoktor , @stevenhorsman , @gkurz , @littlejawa) to build a "task force" aiming to improve the CI situation as much as possible.

We want CI be your friend!

Work items

Find them on the the dashboard: https://github.com/orgs/kata-containers/projects/46/views/1

Old table:

Item Owner Issues Status
"know your enemy" - gather data out of CI to identify unstable jobs @wainersm
Establish and document polices (e.g. when to promote/demote jobs to "required")
Optimize execution of jobs (e.g. triggers by files touched) @ldoktor
Foster the fix of current broken jobs / make nightly CI green

Done criteria

When we will be done with this initiative?

Syncing up

TBD - Every X days meeting? or Slack?

Volunteers

We need help! and everybody is welcomed! Please add your name:

@ldoktor , @stevenhorsman , @gkurz , @littlejawa, @sprt

Additional information

Common Tailorbird is a mostly green bird which has a stable population

sprt commented 2 months ago

Added myself to the list of volunteers!

ldoktor commented 2 months ago

I already started doing the Optimize execution of jobs (e.g. triggers by files touched), feel free to put me there as a owner

wainersm commented 1 month ago

I made a script to gather data, basic stuff out of the nightly jobs actually...

import requests
import sys
import os
import re

workflow_id='ci-nightly.yaml'
list_workflow_runs_url='https://api.github.com/repos/kata-containers/kata-containers/actions/workflows/' + workflow_id + '/runs'
headers = {"Accept": "application/vnd.github+json" ,"X-GitHub-Api-Version": "2022-11-28"}

token = os.getenv("GITHUB_TOKEN")
if token != None:
    headers['Authorization'] = "Bearer " + token

# Get latest 10 ran workflows.
# TODO: parametize it!
#
r = requests.get("%s?per_page=10" %(list_workflow_runs_url), headers=headers)
r.raise_for_status()

page_size=100
runs_map=[]

for run in r.json()['workflow_runs']:
    entry = {'id': run['id'],
             'created_at': run['created_at'],
             'conclusion': None,
             'jobs': []}

    jobs_map={}
    if run['status'] == "in_progress":
        runs_map.append(entry)
        continue
    else:
        entry['conclusion'] = run['conclusion']

    # Let's paginate as jobs can span in several pages.
    total_count = -1
    page=1
    while True:
        jobs_request = requests.get("%s?per_page=%s&page=%s" % (run['jobs_url'], page_size,page), headers=headers)
        jobs_request.raise_for_status()

        for job in jobs_request.json()['jobs']:
            entry['jobs'].append({'name': job['name'], 'run_id': job['run_id'],
                                    'conclusion': job['conclusion']})

        total_count = max(total_count, jobs_request.json()['total_count'])
        if len(entry['jobs']) >= total_count:
            break
        page += 1

    runs_map.append(entry)

def collect_jobs_stats(workflows_runs):
    '''
    Return a map of {'runs': NUMBER, 'fails': NUMBER} index by job's name
    '''
    stats = {}
    for run in workflows_runs:
        for job in run['jobs']:
            job_stat = stats.get(job['name'], {'runs': 0, 'fails': 0})
            job_stat['runs']+=1
            if job['conclusion'] != 'success':
                job_stat['fails']+=1
                stats[job['name']] = job_stat
    return stats

jobs_stats = collect_jobs_stats(runs_map)

regex = re.compile('kata-containers-ci-on-push / run-.*-tests.*')
for name, stat in jobs_stats.items():
    if regex.match(name):
        print('%s: (%s) fail=%s' % (name, stat['runs'], stat['fails']))

@ldoktor @beraldoleal ^^^^ in case you have free cycles to help with bugs and improving it.

I just ran it, see the results below. Notice that sometimes the parent fails but the children jobs don't get charged by the failure. This is something that could be improved on the script. Although interpreting the data by eyes isn't difficult.

kata-containers-ci-on-push / run-cri-containerd-tests-ppc64le: (1) fail=1
kata-containers-ci-on-push / run-k8s-tests-on-ppc64le: (1) fail=1
kata-containers-ci-on-push / run-k8s-tests-on-zvsi / run-k8s-tests (qemu, nydus, k3s): (9) fail=9
kata-containers-ci-on-push / run-basic-amd64-tests / run-tracing (clh): (9) fail=7
kata-containers-ci-on-push / run-basic-amd64-tests / run-tracing (qemu): (9) fail=8
kata-containers-ci-on-push / run-basic-amd64-tests / run-vfio (clh): (9) fail=5
kata-containers-ci-on-push / run-basic-amd64-tests / run-nerdctl-tests (cloud-hypervisor): (9) fail=2
kata-containers-ci-on-push / run-kata-coco-tests / run-k8s-tests-coco-nontee (qemu-coco-dev, nydus, guest-pull): (9) fail=9
kata-containers-ci-on-push / run-kata-coco-tests / run-k8s-tests-on-sev (qemu-sev, nydus, guest-pull): (9) fail=9
kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (ubuntu, dragonball, small): (9) fail=1
kata-containers-ci-on-push / run-kata-coco-tests / run-k8s-tests-on-tdx (qemu-tdx, nydus, guest-pull): (9) fail=9
kata-containers-ci-on-push / run-kata-coco-tests / run-k8s-tests-sev-snp (qemu-snp, nydus, guest-pull): (9) fail=9
kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (cbl-mariner, clh, small, oci-distribution): (9) fail=1
kata-containers-ci-on-push / run-kata-deploy-tests-on-garm / run-kata-deploy-tests (clh, rke2): (9) fail=3
kata-containers-ci-on-push / run-kata-deploy-tests-on-garm / run-kata-deploy-tests (qemu, k0s): (9) fail=3
kata-containers-ci-on-push / run-kata-deploy-tests-on-garm / run-kata-deploy-tests (qemu, k3s): (9) fail=1
kata-containers-ci-on-push / run-metrics-tests / run-metrics (qemu): (9) fail=1
kata-containers-ci-on-push / run-basic-amd64-tests / run-nerdctl-tests (clh): (8) fail=2
kata-containers-ci-on-push / run-kata-monitor-tests / run-monitor (qemu, containerd): (8) fail=7
kata-containers-ci-on-push / run-kata-deploy-tests-on-garm / run-kata-deploy-tests (clh, k0s): (8) fail=2
kata-containers-ci-on-push / run-kata-deploy-tests-on-garm / run-kata-deploy-tests (clh, k3s): (8) fail=2
kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (ubuntu, cloud-hypervisor, normal): (8) fail=1
kata-containers-ci-on-push / run-cri-containerd-tests-s390x: (1) fail=1
kata-containers-ci-on-push / run-k8s-tests-on-zvsi: (1) fail=1
kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (ubuntu, qemu, normal): (7) fail=1
kata-containers-ci-on-push / run-k8s-tests-on-ppc64le / run-k8s-tests (qemu, kubeadm): (7) fail=5
kata-containers-ci-on-push / run-kata-monitor-tests / run-monitor (qemu, crio): (6) fail=1
kata-containers-ci-on-push / run-basic-amd64-tests / run-vfio (qemu): (4) fail=2
kata-containers-ci-on-push / run-kata-monitor-tests: (1) fail=1
kata-containers-ci-on-push / run-metrics-tests: (1) fail=1
kata-containers-ci-on-push / run-basic-amd64-tests: (1) fail=1
kata-containers-ci-on-push / run-kata-deploy-tests-on-garm: (1) fail=1
kata-containers-ci-on-push / run-k8s-tests-on-garm: (1) fail=1
kata-containers-ci-on-push / run-k8s-tests-on-aks: (1) fail=1
kata-containers-ci-on-push / run-k8s-tests-with-crio-on-garm: (1) fail=1
kata-containers-ci-on-push / run-kata-coco-tests: (1) fail=1
kata-containers-ci-on-push / run-kata-deploy-tests-on-aks: (1) fail=1
kata-containers-ci-on-push / run-basic-amd64-tests / run-docker-tests (clh): (2) fail=1
kata-containers-ci-on-push / run-basic-amd64-tests / run-runk: (1) fail=1
stevenhorsman commented 4 weeks ago

@wainersm - optionally suggestion - I wonder whether using the gh cli might be worth considering to avoid things like the pagination? e.g. gh -R kata-containers/kata-containers run list --workflow ci-nightly.yaml -s completed -L 10 --json attempt,conclusion,createdAt,databaseId,name,number to get the specified fields for the last 10 completed jobs and something like: gh -R kata-containers/kata-containers run view <worflow id> --json databaseId,name,status,jobs to get the list of jobs and the status for the workflow id?

beraldoleal commented 3 weeks ago

There is a gh plugin that does that for us, iirc by default goes over the last 100 jobs:

$ gh workflow-stats -o kata-containers -r kata-containers -f ci-on-push.yaml jobs

I pasted the output in our slack channel, but pasting here too for visibility:

🏃 Total runs: 115
  ✔ Success: 0 (0.0%)
  ✖ Failure: 28 (24.3%)
  🤔 Others: 87 (75.7%)

📈 Top 3 jobs with the highest failure counts (failure jobs / total runs)
  kata-containers-ci-on-push / run-basic-amd64-tests / run-tracing (qemu): 33/43
    └──Run tracing tests: 33/43

  kata-containers-ci-on-push / run-kata-monitor-tests / run-monitor (qemu, containerd): 30/40
    └──Run kata-monitor tests: 30/40

  kata-containers-ci-on-push / run-basic-amd64-tests / run-tracing (clh): 22/43
    └──Run tracing tests: 22/43

📊 Top 3 jobs with the longest execution average duration
  kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (ubuntu, clh, small): 1554.00s
  kata-containers-ci-on-push / build-kata-static-tarball-s390x / build-asset (rootfs-image-confidential): 1324.60s
  kata-containers-ci-on-push / build-kata-static-tarball-s390x / build-asset (rootfs-initrd-confidential): 1227.36s
stevenhorsman commented 3 weeks ago

@beraldoleal - that's a cool plugin. I played a bit with the options and came up with: gh -r kata-containers -o kata-containers workflow-stats -f ci-nightly.yaml jobs -n 25 -c ">$(date -d "30 days ago" +%Y-%m-%d)" to show the 25 highest failure counts in the last 30 days, which I guess it getting closer to Wainer's aims. We could use this in combination with some json processing and the gh run list command to filter out only those with a failure percentage above 50% and get specifically the last 10 runs, but I think a time base approach is equally valid.

ldoktor commented 3 weeks ago

Very nice, @beraldoleal, with the --json it should be quite useful (either directly with jq or in python). It even includes the individual steps...

wainersm commented 3 weeks ago

hey @stevenhorsman @beraldoleal @ldoktor thanks for the feedback on the script. The workflow-stats plug-in is indeed very cool! I also played with it a little bit yesterday and it simplifies a lot the python script (that I should send a v2 soon).

One thing that intrigued me, though, is that I asked the tool to gen statistics of last 10 days but the "run count" of most jobs were "16" and I was expecting "~10" (more or less 10 because someone might have triggered the workflows manually).

wainersm commented 2 weeks ago

Hi folks,

Generated the report today again, considering the last 10 executions:

kata-containers-ci-on-push / run-cri-containerd-tests-ppc64le / run-cri-containerd (active, qemu): (10) fail=4 skips=0
kata-containers-ci-on-push / run-metrics-tests / Kata Setup: (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-cri-containerd (lts, clh): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-cri-containerd (lts, dragonball): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-containerd-stability (lts, clh): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-cri-containerd (lts, qemu): (10) fail=1 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-containerd-stability (lts, cloud-hypervisor): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-cri-containerd (lts, stratovirt): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-cri-containerd (lts, cloud-hypervisor): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-containerd-stability (lts, dragonball): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-cri-containerd (lts, qemu-runtime-rs): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-runk: (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-containerd-stability (lts, qemu): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-nydus (lts, clh): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-cri-containerd (active, clh): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-containerd-stability (lts, stratovirt): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-cri-containerd (active, dragonball): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-nydus (lts, qemu): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-containerd-stability (active, clh): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-cri-containerd (active, qemu): (10) fail=2 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-nydus (lts, dragonball): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-containerd-stability (active, cloud-hypervisor): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-nydus (lts, stratovirt): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-containerd-stability (active, dragonball): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-cri-containerd (active, stratovirt): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-nydus (active, clh): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-containerd-stability (active, qemu): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-cri-containerd (active, cloud-hypervisor): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-nydus (active, qemu): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-containerd-stability (active, stratovirt): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-cri-containerd (active, qemu-runtime-rs): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-nydus (active, dragonball): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-tracing: (10) fail=0 skips=10
kata-containers-ci-on-push / run-basic-amd64-tests / run-nydus (active, stratovirt): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-vfio (qemu): (10) fail=7 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-docker-tests (clh): (10) fail=1 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-docker-tests (qemu): (10) fail=0 skips=0
kata-containers-ci-on-push / run-kata-monitor-tests / run-monitor (qemu, crio): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-nerdctl-tests (clh): (10) fail=6 skips=0
kata-containers-ci-on-push / run-kata-monitor-tests / run-monitor (containerd, lts): (10) fail=10 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-nerdctl-tests (dragonball): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-nerdctl-tests (qemu): (10) fail=0 skips=0
kata-containers-ci-on-push / run-basic-amd64-tests / run-nerdctl-tests (cloud-hypervisor): (10) fail=4 skips=0
kata-containers-ci-on-push / run-cri-containerd-tests-s390x / run-cri-containerd (active, qemu): (8) fail=0 skips=0
kata-containers-ci-on-push / run-metrics-tests / run-metrics (clh): (10) fail=0 skips=0
kata-containers-ci-on-push / run-cri-containerd-tests-s390x / run-cri-containerd (active, qemu-runtime-rs): (8) fail=0 skips=0
kata-containers-ci-on-push / run-metrics-tests / run-metrics (qemu): (10) fail=0 skips=0
kata-containers-ci-on-push / run-metrics-tests / run-metrics (stratovirt): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-ppc64le / run-k8s-tests (qemu, kubeadm): (10) fail=9 skips=0
kata-containers-ci-on-push / run-kata-deploy-tests-on-aks / run-kata-deploy-tests (ubuntu, clh): (10) fail=0 skips=0
kata-containers-ci-on-push / run-kata-deploy-tests-on-garm / run-kata-deploy-tests (clh, k0s): (10) fail=5 skips=0
kata-containers-ci-on-push / run-kata-deploy-tests-on-aks / run-kata-deploy-tests (ubuntu, dragonball): (10) fail=2 skips=0
kata-containers-ci-on-push / run-kata-deploy-tests-on-garm / run-kata-deploy-tests (clh, k3s): (10) fail=1 skips=0
kata-containers-ci-on-push / run-kata-deploy-tests-on-aks / run-kata-deploy-tests (ubuntu, qemu): (10) fail=0 skips=0
kata-containers-ci-on-push / run-kata-deploy-tests-on-garm / run-kata-deploy-tests (clh, rke2): (10) fail=6 skips=0
kata-containers-ci-on-push / run-kata-deploy-tests-on-aks / run-kata-deploy-tests (ubuntu, qemu-runtime-rs): (3) fail=3 skips=0
kata-containers-ci-on-push / run-kata-deploy-tests-on-garm / run-kata-deploy-tests (qemu, k0s): (10) fail=5 skips=0
kata-containers-ci-on-push / run-kata-deploy-tests-on-aks / run-kata-deploy-tests (cbl-mariner, clh): (10) fail=1 skips=0
kata-containers-ci-on-push / run-kata-deploy-tests-on-garm / run-kata-deploy-tests (qemu, k3s): (10) fail=1 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (ubuntu, clh, small): (10) fail=0 skips=0
kata-containers-ci-on-push / run-kata-deploy-tests-on-garm / run-kata-deploy-tests (qemu, rke2): (10) fail=2 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (ubuntu, clh, normal): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (ubuntu, dragonball, small): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (ubuntu, dragonball, normal): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (ubuntu, qemu, small): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (ubuntu, qemu, normal): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (ubuntu, stratovirt, small): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (ubuntu, stratovirt, normal): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (ubuntu, cloud-hypervisor, small): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-garm / run-k8s-tests (clh, devmapper, k3s, garm-ubuntu-2004): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-with-crio-on-garm / run-k8s-tests (qemu, k0s, garm-ubuntu-2204): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (ubuntu, cloud-hypervisor, normal): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-garm / run-k8s-tests (clh, devmapper, k3s, garm-ubuntu-2004-smaller): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-with-crio-on-garm / run-k8s-tests (qemu, k0s, garm-ubuntu-2204-smaller): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (cbl-mariner, clh, small, oci-distribution): (10) fail=7 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-garm / run-k8s-tests (dragonball, devmapper, k3s, garm-ubuntu-2004): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (cbl-mariner, clh, small, containerd): (10) fail=7 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-garm / run-k8s-tests (dragonball, devmapper, k3s, garm-ubuntu-2004-smaller): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-aks / run-k8s-tests (cbl-mariner, clh, normal): (10) fail=0 skips=0
kata-containers-ci-on-push / run-kata-coco-tests / run-k8s-tests-on-sev (qemu-sev, nydus, guest-pull): (10) fail=1 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-garm / run-k8s-tests (fc, devmapper, k3s, garm-ubuntu-2004): (10) fail=0 skips=0
kata-containers-ci-on-push / run-kata-coco-tests / run-k8s-tests-on-tdx (qemu-tdx, nydus, guest-pull): (10) fail=6 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-garm / run-k8s-tests (fc, devmapper, k3s, garm-ubuntu-2004-smaller): (10) fail=1 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-garm / run-k8s-tests (qemu, devmapper, k3s, garm-ubuntu-2004): (10) fail=1 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-garm / run-k8s-tests (qemu, devmapper, k3s, garm-ubuntu-2004-smaller): (10) fail=0 skips=0
kata-containers-ci-on-push / run-kata-coco-tests / run-k8s-tests-coco-nontee (qemu-coco-dev, nydus, guest-pull): (10) fail=0 skips=0
kata-containers-ci-on-push / run-kata-coco-tests / run-k8s-tests-sev-snp (qemu-snp, nydus, guest-pull): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-garm / run-k8s-tests (cloud-hypervisor, devmapper, k3s, garm-ubuntu-2004): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-garm / run-k8s-tests (cloud-hypervisor, devmapper, k3s, garm-ubuntu-2004-smaller): (10) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-zvsi / run-k8s-tests (devmapper, k3s): (8) fail=0 skips=0
kata-containers-ci-on-push / run-k8s-tests-on-zvsi / run-k8s-tests (nydus, k3s): (8) fail=3 skips=0
kata-containers-ci-on-push / run-cri-containerd-tests-s390x: (2) fail=0 skips=2
kata-containers-ci-on-push / run-k8s-tests-on-zvsi: (2) fail=0 skips=2

Above report is more accurate because I fixed two problems on my script:

Note that it is counting 'canceled' as 'failed'. I might change that in future.

The new version:

import requests
import sys
import os
import re

workflow_id='ci-nightly.yaml'
list_workflow_runs_url='https://api.github.com/repos/kata-containers/kata-containers/actions/workflows/' + workflow_id + '/runs'
headers = {"Accept": "application/vnd.github+json" ,"X-GitHub-Api-Version": "2022-11-28"}

token = os.getenv("GITHUB_TOKEN")
if token != None:
    headers['Authorization'] = "Bearer " + token

# Get latest 10 ran workflows.
# TODO: parametize it!
#
r = requests.get("%s?per_page=10" %(list_workflow_runs_url), headers=headers)
r.raise_for_status()

page_size=100
runs_map=[]

for run in r.json()['workflow_runs']:
    entry = {'id': run['id'],
             'created_at': run['created_at'],
             'conclusion': None,
             'jobs': []}

    jobs_map={}
    if run['status'] == "in_progress":
        runs_map.append(entry)
        continue
    else:
        entry['conclusion'] = run['conclusion']

    # Let's paginate as jobs can span in several pages.
    total_count = -1
    page=1
    while True:
        jobs_request = requests.get("%s?per_page=%s&page=%s" % (run['jobs_url'], page_size,page), headers=headers)
        jobs_request.raise_for_status()

        for job in jobs_request.json()['jobs']:
            entry['jobs'].append({'name': job['name'], 'run_id': job['run_id'],
                                    'conclusion': job['conclusion']})

        total_count = max(total_count, jobs_request.json()['total_count'])
        if len(entry['jobs']) >= total_count:
            break
        page += 1

    runs_map.append(entry)

def collect_jobs_stats(workflows_runs):
    '''
    Return a map of {'runs': NUMBER, 'fails': NUMBER, 'skips': NUMBER} index by job's name
    '''
    stats = {}
    for run in workflows_runs:
        for job in run['jobs']:
            job_stat = stats.get(job['name'], {'runs': 0, 'fails': 0, 'skips': 0})
            job_stat['runs']+=1
            if job['conclusion'] != 'success':
                if job['conclusion'] == 'skipped':
                    job_stat['skips']+=1
                else: # failed and cancelled    
                    job_stat['fails']+=1
            stats[job['name']] = job_stat
    return stats

jobs_stats = collect_jobs_stats(runs_map)

regex = re.compile('kata-containers-ci-on-push / run-.*-tests.*')
for name, stat in jobs_stats.items():
    if regex.match(name):
        print('%s: (%s) fail=%s skips=%s' % (name, stat['runs'], stat['fails'], stat['skips']))