Closed drewmacrae closed 2 years ago
I'm going to try to make sure bazel knows how many cores are on the nightly regression's machines. I don't know if these jobs are slow in wall-clock time because of something internal or not. (one indicator here is that we see more than one kind of build step that's been slowed down. They all use python, some to write vmem files and some to generate header files (which are used by c and rust).
There's a quirk with Bazel that we've encountered before that may be exacerbating a resource constraint here. Jobs are dispatched in a manner appropriate for the host, which may not fit well with a VM or other requirements on cloud machines. We'll try limiting our dispatched build jobs to something more appropriate for the VMs we build on and see if that helps.
Adding the list of tests affected for traceability
I bet the bazel builds in the public VCS runners are affected by the same issue and could be improved by telling bazel to try to use fewer cores.
We don't know how to reproduce this failure, but it may have been resolved. We've requested resources more inline with what's used by the runners. We haven't limited the number of tasks simultaneously issued by bazel yet. We should reopen if we see this again.
Hierarchy of regression failure
Chip Level
Failure Description
Failure Buckets
Some pass patterns missing: ['^TEST PASSED (UVM_)?CHECKS$']
has 27 failures:Test chip_sw_example_flash has 1 failures.
0.chip_sw_example_flash.723784426\ Log /container/opentitan-public/scratch/os_regression/chip_earlgrey_asic-sim-vcs/0.chip_sw_example_flash/latest/run.log
Test chip_sw_example_concurrency has 1 failures.
0.chip_sw_example_concurrency.3476378567\ Log /container/opentitan-public/scratch/os_regression/chip_earlgrey_asic-sim-vcs/0.chip_sw_example_concurrency/latest/run.log
Test chip_sw_sleep_pin_mio_dio_val has 1 failures.
0.chip_sw_sleep_pin_mio_dio_val.3452962466\ Log /container/opentitan-public/scratch/os_regression/chip_earlgrey_asic-sim-vcs/0.chip_sw_sleep_pin_mio_dio_val/latest/run.log
Test chip_sw_exit_test_unlocked_bootstrap has 1 failures.
0.chip_sw_exit_test_unlocked_bootstrap.1327559239\ Log /container/opentitan-public/scratch/os_regression/chip_earlgrey_asic-sim-vcs/0.chip_sw_exit_test_unlocked_bootstrap/latest/run.log
Test chip_sw_spi_host_tx_rx has 1 failures.
0.chip_sw_spi_host_tx_rx.4006519917\ Log /container/opentitan-public/scratch/os_regression/chip_earlgrey_asic-sim-vcs/0.chip_sw_spi_host_tx_rx/latest/run.log
... and 20 more tests.
In each case it appears that we are running a lot of steps in parallel, and they're running orders of magnitude longer than we expect. I think we ran a lot in parallel before, something got slower and is causing it to time out.
Steps to Reproduce
Tests with similar or related failures
See failure buckets. Doesn't appear to cluster on specific tests so I think it's more to do with the environment and a resource issue.