[test-triage] Bazel builds terminating without reported error in 25 nightly regressions

drewmacrae commented 2 years ago

Hierarchy of regression failure

Chip Level

Failure Description

Failure Buckets

Some pass patterns missing: ['^TEST PASSED (UVM_)?CHECKS$'] has 27 failures:

Test chip_sw_example_flash has 1 failures.

0.chip_sw_example_flash.723784426\ Log /container/opentitan-public/scratch/os_regression/chip_earlgrey_asic-sim-vcs/0.chip_sw_example_flash/latest/run.log

INFO: From Running Cargo build script libftdi1_sys:
Build Script Warning: Config for libftdi1 not found, falling back to default search path: `"pkg-config" "--libs" "--cflags" "libftdi1" "libftdi1 >= 1.4"` did not exit successfully: exit status: 1
[620 / 977] Compiling Rust rlib regex_syntax v0.6.27 (32 files); 71s processwrapper-sandbox ... (60 actions, 59 running)
[667 / 977] Linking sw/device/lib/base/libbitfield.a; 58s processwrapper-sandbox ... (60 actions, 57 running)
[688 / 977] Action hw/ip/otp_ctrl/data/img_rma.24.vmem; 85s processwrapper-sandbox ... (60 actions, 58 running)
[696 / 977] Action hw/ip/otp_ctrl/data/img_rma.24.vmem; 135s processwrapper-sandbox ... (56 actions, 55 running)
[699 / 977] Action hw/ip/otp_ctrl/data/img_rma.24.vmem; 221s processwrapper-sandbox ... (53 actions running)
[700 / 977] Action hw/ip/otp_ctrl/data/img_rma.24.vmem; 288s processwrapper-sandbox ... (57 actions, 56 running)
[701 / 977] Action hw/ip/otp_ctrl/data/img_rma.24.vmem; 396s processwrapper-sandbox ... (58 actions running)
[702 / 977] Action hw/ip/otp_ctrl/data/img_rma.24.vmem; 499s processwrapper-sandbox ... (58 actions, 57 running)

Test chip_sw_example_concurrency has 1 failures.

0.chip_sw_example_concurrency.3476378567\ Log /container/opentitan-public/scratch/os_regression/chip_earlgrey_asic-sim-vcs/0.chip_sw_example_concurrency/latest/run.log

[638 / 977] Compiling Rust rlib libc v0.2.133 (213 files); 52s processwrapper-sandbox ... (60 actions, 58 running)
[664 / 977] Linking sw/device/silicon_creator/lib/libepmp_state.a; 94s processwrapper-sandbox ... (58 actions running)
[677 / 977] Action hw/ip/uart/data/uart_regs.h; 113s processwrapper-sandbox ... (53 actions, 52 running)
[691 / 977] Action hw/ip/uart/data/uart_regs.h; 165s processwrapper-sandbox ... (58 actions running)
[692 / 977] Action hw/ip/uart/data/uart_regs.h; 221s processwrapper-sandbox ... (60 actions, 57 running)
[695 / 977] Action hw/ip/uart/data/uart_regs.h; 308s processwrapper-sandbox ... (59 actions running)
[709 / 977] Action hw/ip/uart/data/uart_regs.h; 392s processwrapper-sandbox ... (60 actions running)
[715 / 977] Action hw/ip/uart/data/uart_regs.h; 744s processwrapper-sandbox ... (54 actions running)
[718 / 977] Action hw/ip/uart/data/uart_regs.h; 939s processwrapper-sandbox ... (51 actions running)
[721 / 977] Action hw/ip/uart/data/uart_regs.h; 1710s processwrapper-sandbox ... (54 actions running)

Test chip_sw_sleep_pin_mio_dio_val has 1 failures.

0.chip_sw_sleep_pin_mio_dio_val.3452962466\ Log /container/opentitan-public/scratch/os_regression/chip_earlgrey_asic-sim-vcs/0.chip_sw_sleep_pin_mio_dio_val/latest/run.log

[553 / 1,048] Creating runfiles tree bazel-out/k8-fastbuild/bin/sw/device/tests/sim_dv/sleep_pin_mio_dio_val_test_sim_dv.runfiles; 105s local ... (60 actions, 55 running)
[609 / 1,048] Compiling Rust rlib gimli v0.26.2 (49 files); 50s processwrapper-sandbox ... (60 actions, 58 running)
INFO: From Running Cargo build script libftdi1_sys:
Build Script Warning: Config for libftdi1 not found, falling back to default search path: `"pkg-config" "--libs" "--cflags" "libftdi1" "libftdi1 >= 1.4"` did not exit successfully: exit status: 1
[667 / 1,048] Linking sw/device/silicon_creator/lib/libepmp_state.a; 63s processwrapper-sandbox ... (60 actions running)
[692 / 1,048] Action hw/ip/otp_ctrl/data/otp_ctrl_regs.h; 87s processwrapper-sandbox ... (57 actions, 56 running)
[701 / 1,048] Action hw/ip/otp_ctrl/data/otp_ctrl_regs.h; 128s processwrapper-sandbox ... (60 actions running)
[718 / 1,048] Action hw/ip/otp_ctrl/data/otp_ctrl_regs.h; 175s processwrapper-sandbox ... (60 actions, 59 running)
[725 / 1,048] Action hw/ip/otp_ctrl/data/otp_ctrl_regs.h; 232s processwrapper-sandbox ... (60 actions running)
[744 / 1,048] Action hw/ip/otp_ctrl/data/otp_ctrl_regs.h; 299s processwrapper-sandbox ... (60 actions, 59 running)

Test chip_sw_exit_test_unlocked_bootstrap has 1 failures.

0.chip_sw_exit_test_unlocked_bootstrap.1327559239\ Log /container/opentitan-public/scratch/os_regression/chip_earlgrey_asic-sim-vcs/0.chip_sw_exit_test_unlocked_bootstrap/latest/run.log

INFO: From Running Cargo build script libftdi1_sys:
Build Script Warning: Config for libftdi1 not found, falling back to default search path: `"pkg-config" "--libs" "--cflags" "libftdi1" "libftdi1 >= 1.4"` did not exit successfully: exit status: 1
[657 / 980] Linking sw/device/silicon_creator/lib/base/libstatic_critical_epmp_state.lo; 62s processwrapper-sandbox ... (59 actions, 54 running)
[669 / 980] Compiling Rust rlib bitvec v1.0.1 (66 files); 87s processwrapper-sandbox ... (57 actions, 56 running)
[697 / 980] Action hw/top_earlgrey/ip/sensor_ctrl/data/sensor_ctrl_regs.h; 126s processwrapper-sandbox ... (55 actions, 53 running)
[703 / 980] Action hw/top_earlgrey/ip/sensor_ctrl/data/sensor_ctrl_regs.h; 176s processwrapper-sandbox ... (53 actions, 50 running)
[708 / 980] Action hw/top_earlgrey/ip/sensor_ctrl/data/sensor_ctrl_regs.h; 265s processwrapper-sandbox ... (50 actions running)
[709 / 980] Action hw/top_earlgrey/ip/sensor_ctrl/data/sensor_ctrl_regs.h; 352s processwrapper-sandbox ... (50 actions, 49 running)
[714 / 980] Action hw/top_earlgrey/ip/sensor_ctrl/data/sensor_ctrl_regs.h; 433s processwrapper-sandbox ... (48 actions running)
[714 / 980] Action hw/top_earlgrey/ip/sensor_ctrl/data/sensor_ctrl_regs.h; 547s processwrapper-sandbox ... (48 actions running)

Test chip_sw_spi_host_tx_rx has 1 failures.

0.chip_sw_spi_host_tx_rx.4006519917\ Log /container/opentitan-public/scratch/os_regression/chip_earlgrey_asic-sim-vcs/0.chip_sw_spi_host_tx_rx/latest/run.log

[685 / 981] Action hw/ip/otp_ctrl/data/img_rma.24.vmem; 128s processwrapper-sandbox ... (48 actions running)
[711 / 981] Action hw/ip/otp_ctrl/data/img_rma.24.vmem; 176s processwrapper-sandbox ... (60 actions, 57 running)
[742 / 981] Action hw/ip/otp_ctrl/data/img_rma.24.vmem; 230s processwrapper-sandbox ... (47 actions, 46 running)
[757 / 981] Action hw/ip/otp_ctrl/data/img_rma.24.vmem; 294s processwrapper-sandbox ... (44 actions, 43 running)
INFO: From Action hw/ip/spi_host/data/spi_host_regs.h:
/workspace/mnt/repo_top/util/reggen/gen_cheader.py:212: UserWarning: Cannot generate a module define of type logic
  warnings.warn("Cannot generate a module define of type {}"
[761 / 981] Action hw/ip/otp_ctrl/data/img_rma.24.vmem; 372s processwrapper-sandbox ... (45 actions running)
[768 / 981] Action hw/ip/otp_ctrl/data/img_rma.24.vmem; 467s processwrapper-sandbox ... (44 actions running)
[768 / 981] Action hw/ip/otp_ctrl/data/img_rma.24.vmem; 1403s processwrapper-sandbox ... (44 actions running)

... and 20 more tests.

In each case it appears that we are running a lot of steps in parallel, and they're running orders of magnitude longer than we expect. I think we ran a lot in parallel before, something got slower and is causing it to time out.

Steps to Reproduce

Commit hash where failure was observed: e79e3525c284efe83f957ba62a24cd91be94370b
We don't yet know how to reproduce
Kokoro build number: 378

Tests with similar or related failures

See failure buckets. Doesn't appear to cluster on specific tests so I think it's more to do with the environment and a resource issue.

drewmacrae commented 2 years ago

I'm going to try to make sure bazel knows how many cores are on the nightly regression's machines. I don't know if these jobs are slow in wall-clock time because of something internal or not. (one indicator here is that we see more than one kind of build step that's been slowed down. They all use python, some to write vmem files and some to generate header files (which are used by c and rust).

drewmacrae commented 2 years ago

There's a quirk with Bazel that we've encountered before that may be exacerbating a resource constraint here. Jobs are dispatched in a manner appropriate for the host, which may not fit well with a VM or other requirements on cloud machines. We'll try limiting our dispatched build jobs to something more appropriate for the VMs we build on and see if that helps.

engdoreis commented 2 years ago

Adding the list of tests affected for traceability

[ ] chip_sw_example_flash
[ ] chip_sw_pwrmgr_normal_sleep_all_wake_ups
[ ] chip_sw_keymgr_key_derivation_jitter_en
[ ] chip_sw_csrng_smoketest
[ ] chip_sw_edn_entropy_reqs
[ ] chip_sw_flash_ctrl_ops_jitter_en
[ ] chip_sw_keymgr_key_derivation_prod
[ ] chip_sw_kmac_mode_kmac_jitter_en
[ ] chip_sw_otbn_ecdsa_op_irq_jitter_en
[ ] chip_sw_otbn_mem_scramble
[ ] chip_sw_plic_sw_irq
[ ] chip_sw_pwrmgr_sysrst_ctrl_reset
[ ] chip_sw_rstmgr_smoketest
[ ] chip_sw_rstmgr_sw_rst
[ ] chip_sw_rv_timer_smoketest
[ ] chip_sw_sleep_pin_mio_dio_val
[ ] chip_sw_spi_host_tx_rx
[ ] chip_sw_sram_ctrl_execution_main
[ ] chip_sw_sysrst_ctrl_ec_rst_l
[ ] chip_sw_example_concurrency
[ ] chip_sw_sram_ctrl_execution_main
[ ] chip_sw_exit_test_unlocked_bootstrap
[ ] chip_sw_rv_core_ibex_lockstep_glitch
[ ] chip_sw_uart_tx_rx_idx2
[ ] chip_sw_all_escalation_resets
[ ] chip_sw_uart_rand_baudrate
[ ] chip_same_csr_outstanding

drewmacrae commented 2 years ago

I bet the bazel builds in the public VCS runners are affected by the same issue and could be improved by telling bazel to try to use fewer cores.

drewmacrae commented 2 years ago

We don't know how to reproduce this failure, but it may have been resolved. We've requested resources more inline with what's used by the runners. We haven't limited the number of tasks simultaneously issued by bazel yet. We should reopen if we see this again.

lowRISC / opentitan