lowRISC / opentitan

OpenTitan: Open source silicon root of trust
https://www.opentitan.org
Apache License 2.0
2.57k stars 771 forks source link

[test-triage] Bazel builds terminating without reported error in 25 nightly regressions #15474

Closed drewmacrae closed 2 years ago

drewmacrae commented 2 years ago

Hierarchy of regression failure

Chip Level

Failure Description

Failure Buckets

In each case it appears that we are running a lot of steps in parallel, and they're running orders of magnitude longer than we expect. I think we ran a lot in parallel before, something got slower and is causing it to time out.

Steps to Reproduce

Tests with similar or related failures

See failure buckets. Doesn't appear to cluster on specific tests so I think it's more to do with the environment and a resource issue.

drewmacrae commented 2 years ago

I'm going to try to make sure bazel knows how many cores are on the nightly regression's machines. I don't know if these jobs are slow in wall-clock time because of something internal or not. (one indicator here is that we see more than one kind of build step that's been slowed down. They all use python, some to write vmem files and some to generate header files (which are used by c and rust).

drewmacrae commented 2 years ago

There's a quirk with Bazel that we've encountered before that may be exacerbating a resource constraint here. Jobs are dispatched in a manner appropriate for the host, which may not fit well with a VM or other requirements on cloud machines. We'll try limiting our dispatched build jobs to something more appropriate for the VMs we build on and see if that helps.

engdoreis commented 2 years ago

Adding the list of tests affected for traceability

drewmacrae commented 2 years ago

I bet the bazel builds in the public VCS runners are affected by the same issue and could be improved by telling bazel to try to use fewer cores.

drewmacrae commented 2 years ago

We don't know how to reproduce this failure, but it may have been resolved. We've requested resources more inline with what's used by the runners. We haven't limited the number of tasks simultaneously issued by bazel yet. We should reopen if we see this again.