trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.22k stars 567 forks source link

Trilinos_PR_cuda-11.4.2-uvm-off PR build not running/submitted to CDash starting 2024-01-24 #12696

Open bartlettroscoe opened 9 months ago

bartlettroscoe commented 9 months ago

CC: @trilinos/framework, @sebrowne, @achauphan

Description

As shown in this query, the Trilinos PR build Trilinos_PR_cuda-11.4.2-uvm-off has not posted full results to CDash since early yesterday (2024-01-24) showing:

image

Yet, many PR iterations have run and posted to CDash in that time as shown in this query for the Trilinos_PR_clang-11.0.1 PR build, for example, showing:

image

That is a bunch of PR that are not passing their PR test iterations and will not be getting merged. (This explains why it took so long for the autotester to run on my new PR #12695.)

Looks like this has so far impacted the PRs:

bartlettroscoe commented 9 months ago

@ccober6 and @sebrowne, this would appear to be a catastrophic failure of the PR testing system.

achauphan commented 9 months ago

It appears that all PR GPU machines are down. @sebrowne has a ticket in.

bartlettroscoe commented 9 months ago

It appears that all PR GPU machines are down. @sebrowne has a ticket in.

@sebrowne and @achauphan, given the importance of these PR build machines, is it possible to have some type of monitoring system for these machines so that if they go down, someone can be notified ASAP? There must be monitoring tools that can do this type of thing. (Or Jenkins should be able to do this since it is trying to run these jobs; perhaps with the right Jenkins plugin like described here?) I know all of this autotester and Jenkins stuff is going to be thrown away once Trilinos moves to GHA, but the same issues can occur with that process as well.

The problem right now is that when something goes wrong with the Trilinos infrastructure, it is Trilinos developers that have to detect and report the problem. Problems with the infrastructure will occur from time to time (that is to be expected), but when they do, it would be good if the people maintaining the infrastructure could be directly notified and not have to rely on the Trilinos developers to detect and report problems like this.

achauphan commented 9 months ago

[...] it would be good if the people maintaining the infrastructure could be directly notified and not have to rely on the Trilinos developers to detect and report problems like this.

Agreed, will bring this up at our retro next week to see if there is a reasonable solution we can setup in the interim of AT2. Currently, there is an email sent by Jenkins when a node goes offline which I had missed.

achauphan commented 9 months ago

As a status update, this morning all GPU nodes were brought back online. One node has been brought back offline manually as it is having very odd and poor performance and picked up the first few jobs this morning.

bartlettroscoe commented 9 months ago

FYI: It is not just the CUDA build that has failed to produce PR testing results on CDash. It is also theTrilinos_PR_gcc-8.3.0-debug build rhel7_sems-gnu-8.3.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables as well. See https://github.com/trilinos/Trilinos/pull/12695#issuecomment-1912764585. But there is one of these builds just started 37 minutes ago so it is not clear how serious of a problem this is.

NOTE: It would also be great to set up monitoring of CDash looking for missing PR build results as well. That is similar to looking for randomly failing tests (see TriBITSPub/TriBITS#600) but is does not require knowing the repo versions but it is complicated by the challenge of trying to group builds on CDash that are part of the same PR testing iteration (because all you have to go on is the Build Start Time which are all different but are typically within a few minutes of each other). I suggested this in this internal comment.

ndellingwood commented 9 months ago

PR #12707 was has been hit with a couple issues in the gcc build mentioned above

as well as this build

https://github.com/trilinos/Trilinos/pull/12707#issuecomment-1919946907

That was on ascic166, the node ran out of memory