Closed bartlettroscoe closed 1 year ago
@trilinos/framework,
My PR iteration https://github.com/trilinos/Trilinos/pull/4872#issuecomment-482391276 just failed due to the error:
Caused: java.util.MissingResourceException: Can't find bundle for base name javax.servlet.LocalStrings, locale en_US
at java.util.ResourceBundle.throwMissingResourceException(ResourceBundle.java:1564)
at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1387)
at java.util.ResourceBundle.getBundle(ResourceBundle.java:773)
at javax.servlet.GenericServlet.(GenericServlet.java:95)
Are the Jenkins slaves on these nodes broken or something?
I will try AT: RETEST
.
We are finding more and more ways the PR testing can crash :-) It would be great the assemble the entire list of ways it can crash and then write a paper on this.
@trilinos/framework,
Same two builds crashed in the same way in the next PR iteration https://github.com/trilinos/Trilinos/pull/4872#issuecomment-482397103. So this is not random. Wonder how many PRs are getting held up because of this?
@trilinos/framework,
Yup, it is killing other PRs too, see https://github.com/trilinos/Trilinos/pull/4866#issuecomment-482395983.
Get ready for a PR log jam ...
@trilinos/framework,
And killing another PR https://github.com/trilinos/Trilinos/pull/4864#issuecomment-482369089.
See, people are putting on AT: RETEST
not realizing that it is futile: #4874,
See the note I sent a bit ago – Jenkins did not restart properly after it’s maintenance yesterday.
Recommending policy and creating automation to enable DevOps in the Trilinos and CTH organizations.
From: "Roscoe A. Bartlett" notifications@github.com Reply-To: trilinos/Trilinos reply@reply.github.com Date: Thursday, April 11, 2019 at 6:47 PM To: trilinos/Trilinos Trilinos@noreply.github.com Cc: "Wolfenbarger, Paul R" prwolfe@sandia.gov, Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [trilinos/Trilinos] Trilinos auto PR tester stability issues (#3276)
@trilinos/frameworkhttps://github.com/orgs/trilinos/teams/framework,
My PR iteration #4872 (comment)https://github.com/trilinos/Trilinos/pull/4872#issuecomment-482391276 just failed due to the error:
Caused: java.util.MissingResourceException: Can't find bundle for base name javax.servlet.LocalStrings, locale en_US
at java.util.ResourceBundle.throwMissingResourceException(ResourceBundle.java:1564)
at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1387)
at java.util.ResourceBundle.getBundle(ResourceBundle.java:773)
at javax.servlet.GenericServlet.(GenericServlet.java:95)
Are the Jenkins slaves on these nodes broken or something?
I will try AT: RETEST.
We are finding more and more ways the PR testing can crash :-) It would be great the assemble the entire list of ways it can crash and then write a paper on this.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/3276#issuecomment-482393252, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHTpCWoroZTo2gBMRORz5PA4qpaJo3I8ks5vf9d-gaJpZM4V4Ukw.
@prwolfe said:
See the note I sent a bit ago – Jenkins did not restart properly after it’s maintenance yesterday.
Looks like it is fixed now allowing merges; see https://github.com/trilinos/Trilinos/pull/4875#issuecomment-482594136
@trilinos/framework,
Can the Framework team please set up a notification system by why which the Framework staff can be alerted to these types of problems right when they occur? Otherwise, developers are forced to triage and report infrastructure problems.
I entered the building this morning about 15 minutes before I went to the CEE team room. Not sure what would have made that faster.
I mean, when did the first PR build fail due to this Jenkins issue? Not when did I notice this and report it. What is the earliest time this could have been caught and addressed? Surely that was before my PR build failed and I first noticed this.
3:42pm yesterday – I went home at 3:00pm.
Recommending policy and creating automation to enable DevOps in the Trilinos and CTH organizations.
From: "Roscoe A. Bartlett" notifications@github.com Reply-To: trilinos/Trilinos reply@reply.github.com Date: Friday, April 12, 2019 at 9:38 AM To: trilinos/Trilinos Trilinos@noreply.github.com Cc: "Wolfenbarger, Paul R" prwolfe@sandia.gov, Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [trilinos/Trilinos] Trilinos auto PR tester stability issues (#3276)
I mean, when did the first PR build fail due to this Jenkins issue? Not when did I notice this and report it. What is the earliest time this could have been caught and addressed? Surely that was before my PR build failed and I first noticed this.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/3276#issuecomment-482621191, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHTpCSkJD5b2qhAL89horNbY9KzhmUJcks5vgKhlgaJpZM4V4Ukw.
Same problem here: https://github.com/trilinos/Trilinos/pull/4863.
@trilinos/framework,
Looks like more problems with the Trilinos PR tester. The PR #4892 had the PR tester waiting 12 hours before it even started running a PR testing builds and then 6 hours later it timed out with the message:
NOTICE: The AutoTester has encountered an internal error (usually a Communications Timeout), testing will be restarted, previous tests may still be running but will be ignored by the AutoTester...
What is wrong with the Trilinos autotester at this point?
NOTE: I order to not have problems with the Trilinos autotester hold up progress fixing things for ATDM, I have been having to merge selected branches directly to the 'atdm-nightly' branch. This means going forward that ATDM customers should be pulling form the 'atdm-nightly' branch, not the 'develop' or 'master' branches.
FYI: Looks like there may be a defect in the PR tester related to approvals as evidenced by #4916. I rebased and pushed the branch and then re-approved the PR (which @nmhamster had created so my approval should be enough to pass the approval check). But the autotester said it was not approved even through GitHub showed the PR had an active approval.
@trilinos/framework,
Looks like another infrastructure failure mode has hit the PR tester in https://github.com/trilinos/Trilinos/pull/5004#issuecomment-486418165. There, it says that the Trilinos_pullrequest_intel_17.0.1
build (Build Num: 3183) failed yet it is clearly shown 100% passing on CDash as shown here.
The full Jenkins job output is shown at:
and shows:
...
12:04:12 Starting configure step.
12:04:12 Each . represents 1024 bytes of output
12:04:12 .................................................. Size: 50K
12:04:13 .................................................. Size: 100K
12:15:19 .................................................. Size: 150K
12:15:54 .......................... Size of output: 175K
12:17:14 configure submit error = 0
12:17:14 Configure suceeded.
12:17:14 Starting build step.
12:17:14 Each symbol represents 1024 bytes of output.
12:17:14 .................................................. Size: 49K
...
14:29:58 .................................................. Size: 15449K
14:30:22 ................................. Size of output: 15483K
14:30:32 Build succeeded.
14:30:33 build submit error = 0
14:30:33 Starting testing step.
14:41:55 Tests succeeded.
14:41:57 test submit error = 0
14:41:57 File upload submit error = 0
14:41:57 Single configure/build/test failed. The error code was: 255
14:41:57 Build step 'Execute shell' marked build as failure
14:41:57 Archiving artifacts
14:41:58 Finished: FAILURE
So the configure, build, and tests all "succeeded" and yet the PR tester script code reported:
Single configure/build/test failed. The error code was: 255
How is that possible?
Is the autotester trusting the return code from ctest -S <script>.cmake
? We know that is not reliable as described in:
You need to use a different method to determine pass/fail of a ctest -S
script other than the return code.
This appears to be an error early in the script
ctest_empty_binary_directory problem removing the binary directory
Which we have discussed before. I have yet to find a way to reset this error, but I have not had much time to spend on it either.
@prwolfe said:
This appears to be an error early in the script
Okay, I see it now:
14:04:12 CMake Error at /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_intel_17.0.1/TFW_testing_single_configure_prototype/simple_testing.cmake:118 (ctest_empty_binary_directory):
14:04:12 ctest_empty_binary_directory problem removing the binary directory:
14:04:12 /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_intel_17.0.1/pull_request_test
If you want to make this more robust, you are going to need to manually remove the build directory. But the fact that this is occurring suggests that the CMakeCache.txt file is not getting created on a previous Jenkins run. Is that possible with the current setup?
Yes - it is possible. And changing that would be difficult. I take it there is not a way to reset the error from your past experience?
@prwolfe asked:
I take it there is not a way to reset the error from your past experience?
No. Actually, error handling in cmake/ctest is not very well defined in some respects. See:
My advice is that if you want to build from scratch robustly, just manually delete the build directory yourself either in the cmake code or in the bash code.
FYI: Took 19 hours from the time that I created the PR #5040 till the PR tester started testing the build as shown here. Is the PR testing system that backed up? What could cause a 19 hour delay before it even starts testing a PR?
FYI: I manually merged that branch into the 'atdm-nightly' branch last night so this has installs on 'waterman' going so having a massive delay in the PR tester like this is not a show stopper in this case.
CC: @trilinos/framework
FYI: More random failures see in #5040. There are some new ones. The testing iteration last night shown in https://github.com/trilinos/Trilinos/pull/5040#issuecomment-488118973 showed the builds Trilinos_pullrequest_intel_17.0.1
(Build Num: 3228) and Trilinos_pullrequest_gcc_7.2.0
(Build Num: 1408) failing with build and test failures. Yet without changing anything on that branch, the next testing iteration shown in https://github.com/trilinos/Trilinos/pull/5040#issuecomment-488206488 passed.
The failures in the Trilinos_pullrequest_intel_17.0.1
build shown here failed the build with errors like:
/projects/sems/install/rhel6-x86_64/sems/compiler/intel/17.0.1/mpich/3.2/bin/mpicxx: line 282: 33698 Segmentation fault (core dumped) $Show $CXX ${final_cppflags} $PROFILE_INCPATHS ${final_cxxflags} "${allargs[@]}" -I$includedir
What is up with that?
And the build Trilinos_pullrequest_gcc_7.2.0
shown here showed hundreds of test failures showing errors like:
-------------------------------------------------------------------------
Open MPI was unable to obtain the username in order to create a path
for its required temporary directories. This type of error is usually
caused by a transient failure of network-based authentication services
(e.g., LDAP or NIS failure due to network congestion), but can also be
an indication of system misconfiguration.
Please consult your system administrator about these issues and try
again.
--------------------------------------------------------------------------
[ascic158:80974] 5 more processes have sent help message help-orte-runtime.txt / orte:session:dir:nopwname
[ascic158:80974] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
But all of this passed the next testing iteration. Why is that?
Are these bad nodes? Seems like some analysis would be good to try to avoid this on future PR jobs.
@trilinos/framework
Looks like PR builds are crashing due to new "clean_workspace" module load problems. See https://github.com/trilinos/Trilinos/pull/5068#issuecomment-488587688.
For that PR iteration, all of the builds are crashing except for the Intel 17.0.1 build showing:
Cleaning directory pull_request_test due to command line option
Traceback (most recent call last):
File "Trilinos/commonTools/framework/clean_workspace/clean_workspace", line 78, in
clean.run()
File "Trilinos/commonTools/framework/clean_workspace/clean_workspace", line 41, in run
self.force_clean_space()
File "Trilinos/commonTools/framework/clean_workspace/clean_workspace", line 65, in force_clean_space
module('load', 'sems-env')
File "/scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.8.4/Trilinos/commonTools/framework/clean_workspace/Modules.py", line 181, in module
return Module().module(command, *arguments)
File "/scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.8.4/Trilinos/commonTools/framework/clean_workspace/Modules.py", line 151, in module
raise RuntimeError(stderr)
RuntimeError: ModuleCmd_Load.c(208):ERROR:105: Unable to locate a modulefile for 'sems-env'
Build step 'Execute shell' marked build as failure
Archiving artifacts
Finished: FAILURE
It looks like many/all of the recent PR builds are crashing due to this (see #5071, #5063, #5066, #5064, #5062, #5054, ...). There does not seem to be any pattern to which builds are showing this problem.
FYI: The Trilinos PR testing system seems to be a bit overloaded currently. For my recent PR #5215, it took 7 hours for the PR tester to even start testing the branch. It seems the rebuilds have not been implemented yet since it took the PR tester 5 hours to run the builds and tests. So the total time from when the PR was created to when it was pushed was 12 hours.
FYI: I have been running an experiment with the post-push CI build that I run where I rebuild every iteration since 3/7/2019 (see c923294). That rebuilt every CI build of 230 consecutive builds as shown here. This just failed today due to the merging of some Panzer changes in commit 6cede2e from Pr #5228. @rppawlo confirmed that you need to wipe the Panzer build dir to fix this.
I manually wiped out the Panzer build directory with:
$ rm -r BUILD/packages/panzer/
so next time the post-push CI build runs, it will rebuild everything in Panzer (but it will be very fast because all of the other object files are still there).
Note that as shown in this query, the median rebuild time is about 22 minutes compared to 4 hours for a complete from-scratch build.
That shows how successful rebuilds would be for the Trilinos PR system. It would be huge.
Ross,
We turned on incremental rebuilds two days ago, but Rogers merge meant we had to do rebuilds. Ride has been running longer and does seem to be very useful.
From: Bartlett, Roscoe A Sent: Monday, August 05, 2019 12:26 PM To: trilinos-framework@software.sandia.gov Subject: FYI: Randomly failing test MueLu_Maxwell3D-Tpetra- Stratimikos_MPI_4
Hello Trilinos Framework team,
FYI: Looks like the test MueLu_Maxwell3D-Tpetra-Stratimikos_MPI_4 is randomly failing in the GCC 4.8.4 build (which should match the GCC 4.8.4 PR build). As shown at:
https://testing- dev.sandia.gov/cdash/index.php?project=Trilinos&filtercount=3&showfilters =1&filtercombine=and&field1=buildname&compare1=61&value1=Linux-GCC- 4.8.4- MPI_RELEASE_DEBUG_SHARED_PT_OPENMP_CI&field2=groupname&compar e2=61&value2=Continuous&field3=buildstarttime&compare3=83&value3=16 %20weeks%20ago
it failed twice since 7/25/2019.
This will be bringing down PR builds randomly.
Can someone create a GitHub issue to address this?
-Ross
Dr. Roscoe A. Bartlett, PhD https://bartlettroscoe.github.io/ Sandia National Laboratories
-----Original Message----- From: CDash admin@cdash.org Sent: Monday, August 05, 2019 11:18 AM To: Bartlett, Roscoe A rabartl@sandia.gov Subject: [EXTERNAL] FAILED (t=1): Trilinos/MueLu - Linux-GCC-4.8.4- MPI_RELEASE_DEBUG_SHARED_PT_OPENMP_CI - Continuous
A submission to CDash for the project Trilinos has failing tests. You have been identified as one of the authors who have checked in changes that are part of this submission or you are listed in the default contact list.
Details on the submission can be found at https://testing.sandia.gov/cdash/buildSummary.php?buildid=5418940
Project: Trilinos SubProject: MueLu Site: ceerws1113 Build Name: Linux-GCC-4.8.4- MPI_RELEASE_DEBUG_SHARED_PT_OPENMP_CI Build Time: 2019-08-05T14:46:26 UTC Type: Continuous Tests not passing: 1
Tests failing MueLu_Maxwell3D-Tpetra-Stratimikos_MPI_4 | Completed (Failed) | (https://testing.sandia.gov/cdash/testDetails.php?test=85726525&build= 541 8940)
-CDash on testing.sandia.gov
@trilinos/framework
Looks like we have got a randomly failing test SEACASIoss_pamgen_exodus_io_info
in the CUDA and GCC PR builds as already reported in #5794 (in the context of ATDM). It just took out my PR build #5916 shown here.
Looks like this test SEACASIoss_pamgen_exodus_io_info
has taken out several PR build iterations recently as shown here.
You might want to disable this test in PR testing going forward until it gets fixed ...
When is the Trilinos PR tester going to be updated to only redo builds that failed and do rebuilds, instead of builds from scratch?
I have a fix for this, will add a PR later today hopefully.
@trilinos/framework
Looks like there is a problem with the PR testing with MPI startup. My PR #6034 showed two failing tests in different PR builds:
both showing output like:
[tr-test-7.novalocal:06582] opal_os_dirpath_create: Error: Unable to create the sub-directory (/tmp/openmpi-sessions-trilinos@tr-test-7_0/5848) of (/tmp/openmpi-sessions-trilinos@tr-test-7_0/5848/0/0), mkdir failed [1]
[tr-test-7.novalocal:06582] [[5848,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 107
[tr-test-7.novalocal:06582] [[5848,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 402
[tr-test-7.novalocal:06582] [[5848,0],0] ORTE_ERROR_LOG: Error in file ess_hnp_module.c at line 638
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_session_dir failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
I don't think is is possible that my code changes could have triggered this MPI startup failure.
Any idea what this is about?
These failures occurred on 2 very different systems (one physical and one VM) and both appear to have sufficient space and permissions for the file to be created. Without knowing more about the routine in MPI I would simply be guessing here.
FYI: As shown in this query looks like only my PR is impacted so far. Will see what happens on the next PR iteration.
FYI: The last PR iteration for #6034 passed all the tests. However, someone should keep an eye on the PR testing to see if the strange "Unable to create the sub-directory" mentioned above takes out any other PR iterations. (That error took out the first 2 PR testing iterations for #6034.)
When is the PR tester going to be extended to only rerun testing for builds that failed, leaving the builds that passed in place? This is critical to improve the robustness of the PR tester and allow for more PR builds to be added without seriously damaging PR testing stability.
Also, when are these PR builds going to utilize rebuilds? The post-push CI testing that has been running for months shown here has a median time to rebuild of about 20 minutes (where the build -from-scratch is about 3.5 hours) and has been very robust for several months now rebuilding over and over again. That would massively speed up PR testing.
FYI: Another randomly failing test killing a PR iteration https://github.com/trilinos/Trilinos/pull/6045#issuecomment-538617025 and requiring all of the PR builds to be run from scratch! This is a huge waste of resources. If the PR tester would rebuild and just rerun the build that failed, it would have taken just a couple of minutes and used almost no computing resources.
FYI: We should have this randomly failing set of tests fixed in PR #6050
@trilinos/framework
The last PR iteration for #6050 shown here was very fast. Was that using a rebuild? Can the PR tester do a rebuild in some cases? If so, why not in every case?
starting last night
starting last night
starting what?
starting last night
starting what?
Broken PR testing
@prwolfe See, for example, #6109.
Hmm, this looks like the entire network went down for a bit last night. I see failures submitting to dash, java errors in Jenkins, nfs errors across the board and a few clone errors due to timeouts. I will ask a few questions.
Any update on this? The links to cdash PR results are dead, but maybe this is an unrelated issue?
@jhux2 - I just came back from talking with the CEE team. They have had network issues all week and had to reset the Jenkins instance yesterday from a snapshot. That snapshot has the PR testing turned off. They are seeing fewer issues today but have no reason to think that corporate has everything fixed for good.
I cannot try turning that back on if CDash is down as it will fail for that reason (it's all connected). I just sent an email to Ross and others about the CDash instance and will try everything when that comes back.
@prwolfe Thank you for the update!
@trilinos/framework,
Given that the most recent merge was:
96f63b1 "Merge Pull Request #6105 from bartlettroscoe/Trilinos/6017-atdm-allow-enable-percept"
Author: trilinos-autotester <trilinos-autotester@trilinos.org>
Date: Wed Oct 16 11:00:20 2019 -0600 (3 days ago)
it seems that PR testing is not back online yet.
Might I suggest that you switch the PR testing from testing.sandia.gov/cdash/ to testing-dev.sandia.gov/cdash/ for now? That will get PR testing back online. Also, please consider doing dual submits for the PR builds to both sites. It is nice having a backup site.
@bartlettroscoe - I changed to testing-dev this morning and am seeing the same results. Up to and including the testing-dev site giving me a "500 error". I am also seeing more general network errors this morning, so I will follow up on that.
???
Strange, because I just now when to:
and it came up fine. But I am not seeing any "Pull Request" results there yet. Is a PR build running that should be submitting results?
Anyone else not able to get to https://testing-dev.sandia.gov/cdash/index.php?project=Trilinos ?
Anyone else not able to get to https://testing-dev.sandia.gov/cdash/index.php?project=Trilinos ?
works for me
FYI: I am starting to see PR results showing up to:
So it looks like the PR tester might be back online?
No, just instance 0 – and only 3 of the 7 are reporting. I am still seeing stuff like that “500 error” in the log and it looks like the proxy setting for the cloud nodes might be different for testing-dev for whatever reason.
This kind of stuff should already be dealt with…
Recommending policy and creating automation to enable DevOps in the Trilinos and CTH organizations.
From: "Roscoe A. Bartlett" notifications@github.com Reply-To: trilinos/Trilinos reply@reply.github.com Date: Monday, October 21, 2019 at 11:36 AM To: trilinos/Trilinos Trilinos@noreply.github.com Cc: "Wolfenbarger, Paul R" prwolfe@sandia.gov, Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [trilinos/Trilinos] Trilinos auto PR tester stability issues (#3276)
FYI: I am starting to see PR results showing up to:
So it looks like the PR tester might be back online?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/3276?email_source=notifications&email_token=AB2OSCLHOQRVQK7JF6CGRGTQPXR7ZA5CNFSM4FPBJEYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB3E5OI#issuecomment-544624313, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB2OSCP6CCOYKYB7TTIN2XDQPXR7ZANCNFSM4FPBJEYA.
In my three waiting PRs, I do not see evidence that the PR tester is back on line.
I’m not seeing anything either for my PR submitted last week. Also, ROL’s nightly tests aren’t posting to testing. I get an http error when looking for the test results. Is this related?
Also, ROL’s nightly tests aren’t posting to testing. I get an http error when looking for the test results. Is this related?
@dridzal, replace 'testing.sandia.gov/cdash/' with 'testing-dev.sandia.gov/cdash/' and your results should be there.
@trilinos/framework
Description
Over the last few weeks and months, the Trilinos auto PR tester has seen several cases where one or more PR builds for a given PR testing iteration failed to produce results on CDash or showed build or test failures that were not related to the changes on that particular PR.
This Story is to log these fails and keep track of them in order to provide some statistics about these cases in order to inform how to address them. This should replace making comments in individual PRs that exhibit these types of problems like #3260 and #3213.
PR Builds Showing Random Failures
Below are a few examples of the stability problems (but are not all of the problems).