trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 564 forks source link

Framework: random failures blocking merge of PR #11361

Closed jhux2 closed 10 months ago

jhux2 commented 1 year ago

Bug Report

@trilinos/framework

Description

There are a variety of random failures that are blocking the merge of PR #11341. No of these are due to the PR itself, since it touches a Python utility that doesn't get exercised by the AT.

Reported in https://sems-atlassian-son.sandia.gov/jira/servicedesk/customer/portal/7/TRILINOSHD-235.

GrahamBenHarper commented 1 year ago

I believe other PRs have recently run into the problem of running out of memory, but I didn't mark which ones since they were fixed with a retest. If I find more instances, I'll make sure to link them here as related.

GrahamBenHarper commented 1 year ago

https://github.com/trilinos/Trilinos/pull/11330#issuecomment-1338009742 is another example of a job killed by (presumably) running out of memory. I found it by skimming a handful of PR builds which have exactly two errors as shown in this cdash query.

sebrowne commented 1 year ago

The Intel build running out of memory is fixed (not really, but we turned the Intel build back off so a re-run will bypass the failure until we get the correct level of build parallelism figured out such that it won't overload the machine).

The MiniTensor test from earlier in that PR's testing history is fixed now.

The Domi test timed out and was on a CUDA build. That is another thing we have immediate work planned for (hopefully by the end of the week) to avoid due to GPU oversubscription.

Framework empathizes with the difficulty that these nondeterministic results is causing, and we're working towards resolution on each of the issues as fast as we can. I will update this once we get the CUDA changes in and the Intel build turned back on, if that's satisfactory.

github-actions[bot] commented 10 months ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

jhux2 commented 10 months ago

no longer an issue