Trilinos auto PR tester stability issues

bartlettroscoe commented 6 years ago

@trilinos/framework

Description

Over the last few weeks and months, the Trilinos auto PR tester has seen several cases where one or more PR builds for a given PR testing iteration failed to produce results on CDash or showed build or test failures that were not related to the changes on that particular PR.

This Story is to log these fails and keep track of them in order to provide some statistics about these cases in order to inform how to address them. This should replace making comments in individual PRs that exhibit these types of problems like #3260 and #3213.

PR Builds Showing Random Failures

Below are a few examples of the stability problems (but are not all of the problems).

PR ID	Num PR Builds to reach passing	First test trigger	Start first test	Passing test	Merge PR
#3258	2	8/8/2018 2:35 PM ET	8/8/2018 2:44 PM	[8/8/2018 9:15 PM ET]()	Not merged
#3260	4	8/8/2018 5:22 PM ET	8/8/2018 6:31 PM ET	8/10/2018 4:13 AM ET	8/10/2018 8:25 AM
#3213	3	7/31/2018 4:30 PM ET	7/31/2018 4:57 PM ET	8/1/2018 9:48 AM ET	8/1/2018 9:53 AM ET
#3098	4	7/12/2018 12:52 PM ET	7/12/2018 1:07 PM ET	7/13/2018 11:12 PM ET	7/14/2018 10:59 PM ET
#3369	6	8/29/2018 9:08 AM ET	8/29/2018 9:16 AM ET	8/31/2018 6:09 AM ET	8/31/2018 8:33 AM ET

bartlettroscoe commented 5 years ago

@trilinos/framework,

My PR iteration https://github.com/trilinos/Trilinos/pull/4872#issuecomment-482391276 just failed due to the error:

Caused: java.util.MissingResourceException: Can't find bundle for base name javax.servlet.LocalStrings, locale en_US
    at java.util.ResourceBundle.throwMissingResourceException(ResourceBundle.java:1564)
    at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1387)
    at java.util.ResourceBundle.getBundle(ResourceBundle.java:773)
    at javax.servlet.GenericServlet.(GenericServlet.java:95)

Are the Jenkins slaves on these nodes broken or something?

I will try AT: RETEST.

We are finding more and more ways the PR testing can crash :-) It would be great the assemble the entire list of ways it can crash and then write a paper on this.

bartlettroscoe commented 5 years ago

@trilinos/framework,

Same two builds crashed in the same way in the next PR iteration https://github.com/trilinos/Trilinos/pull/4872#issuecomment-482397103. So this is not random. Wonder how many PRs are getting held up because of this?

bartlettroscoe commented 5 years ago

@trilinos/framework,

Yup, it is killing other PRs too, see https://github.com/trilinos/Trilinos/pull/4866#issuecomment-482395983.

Get ready for a PR log jam ...

bartlettroscoe commented 5 years ago

@trilinos/framework,

And killing another PR https://github.com/trilinos/Trilinos/pull/4864#issuecomment-482369089.

bartlettroscoe commented 5 years ago

See, people are putting on AT: RETEST not realizing that it is futile: #4874,

prwolfe commented 5 years ago

See the note I sent a bit ago – Jenkins did not restart properly after it’s maintenance yesterday.

Recommending policy and creating automation to enable DevOps in the Trilinos and CTH organizations.

From: "Roscoe A. Bartlett" notifications@github.com Reply-To: trilinos/Trilinos reply@reply.github.com Date: Thursday, April 11, 2019 at 6:47 PM To: trilinos/Trilinos Trilinos@noreply.github.com Cc: "Wolfenbarger, Paul R" prwolfe@sandia.gov, Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [trilinos/Trilinos] Trilinos auto PR tester stability issues (#3276)

@trilinos/frameworkhttps://github.com/orgs/trilinos/teams/framework,

My PR iteration #4872 (comment)https://github.com/trilinos/Trilinos/pull/4872#issuecomment-482391276 just failed due to the error:

Caused: java.util.MissingResourceException: Can't find bundle for base name javax.servlet.LocalStrings, locale en_US

    at java.util.ResourceBundle.throwMissingResourceException(ResourceBundle.java:1564)

    at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1387)

    at java.util.ResourceBundle.getBundle(ResourceBundle.java:773)

    at javax.servlet.GenericServlet.(GenericServlet.java:95)

Are the Jenkins slaves on these nodes broken or something?

I will try AT: RETEST.

We are finding more and more ways the PR testing can crash :-) It would be great the assemble the entire list of ways it can crash and then write a paper on this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/3276#issuecomment-482393252, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHTpCWoroZTo2gBMRORz5PA4qpaJo3I8ks5vf9d-gaJpZM4V4Ukw.

bartlettroscoe commented 5 years ago

@prwolfe said:

See the note I sent a bit ago – Jenkins did not restart properly after it’s maintenance yesterday.

Looks like it is fixed now allowing merges; see https://github.com/trilinos/Trilinos/pull/4875#issuecomment-482594136

@trilinos/framework,

Can the Framework team please set up a notification system by why which the Framework staff can be alerted to these types of problems right when they occur? Otherwise, developers are forced to triage and report infrastructure problems.

prwolfe commented 5 years ago

I entered the building this morning about 15 minutes before I went to the CEE team room. Not sure what would have made that faster.

bartlettroscoe commented 5 years ago

I mean, when did the first PR build fail due to this Jenkins issue? Not when did I notice this and report it. What is the earliest time this could have been caught and addressed? Surely that was before my PR build failed and I first noticed this.

prwolfe commented 5 years ago

3:42pm yesterday – I went home at 3:00pm.

Recommending policy and creating automation to enable DevOps in the Trilinos and CTH organizations.

From: "Roscoe A. Bartlett" notifications@github.com Reply-To: trilinos/Trilinos reply@reply.github.com Date: Friday, April 12, 2019 at 9:38 AM To: trilinos/Trilinos Trilinos@noreply.github.com Cc: "Wolfenbarger, Paul R" prwolfe@sandia.gov, Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [trilinos/Trilinos] Trilinos auto PR tester stability issues (#3276)

I mean, when did the first PR build fail due to this Jenkins issue? Not when did I notice this and report it. What is the earliest time this could have been caught and addressed? Surely that was before my PR build failed and I first noticed this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/3276#issuecomment-482621191, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHTpCSkJD5b2qhAL89horNbY9KzhmUJcks5vgKhlgaJpZM4V4Ukw.

jhux2 commented 5 years ago

Same problem here: https://github.com/trilinos/Trilinos/pull/4863.

bartlettroscoe commented 5 years ago

@trilinos/framework,

Looks like more problems with the Trilinos PR tester. The PR #4892 had the PR tester waiting 12 hours before it even started running a PR testing builds and then 6 hours later it timed out with the message:

NOTICE: The AutoTester has encountered an internal error (usually a Communications Timeout), testing will be restarted, previous tests may still be running but will be ignored by the AutoTester...

What is wrong with the Trilinos autotester at this point?

NOTE: I order to not have problems with the Trilinos autotester hold up progress fixing things for ATDM, I have been having to merge selected branches directly to the 'atdm-nightly' branch. This means going forward that ATDM customers should be pulling form the 'atdm-nightly' branch, not the 'develop' or 'master' branches.

bartlettroscoe commented 5 years ago

FYI: Looks like there may be a defect in the PR tester related to approvals as evidenced by #4916. I rebased and pushed the branch and then re-approved the PR (which @nmhamster had created so my approval should be enough to pass the approval check). But the autotester said it was not approved even through GitHub showed the PR had an active approval.

bartlettroscoe commented 5 years ago

@trilinos/framework,

Looks like another infrastructure failure mode has hit the PR tester in https://github.com/trilinos/Trilinos/pull/5004#issuecomment-486418165. There, it says that the Trilinos_pullrequest_intel_17.0.1 build (Build Num: 3183) failed yet it is clearly shown 100% passing on CDash as shown here.

The full Jenkins job output is shown at:

https://ascic-jenkins.sandia.gov/view/Trilinos/job/trilinos-folder/job/Trilinos_pullrequest_intel_17.0.1/3183/console

and shows:

...
12:04:12 Starting configure step.
12:04:12    Each . represents 1024 bytes of output
12:04:12     ..................................................  Size: 50K
12:04:13     ..................................................  Size: 100K
12:15:19     ..................................................  Size: 150K
12:15:54     .......................... Size of output: 175K
12:17:14 configure submit error = 0
12:17:14 Configure suceeded.
12:17:14 Starting build step.
12:17:14    Each symbol represents 1024 bytes of output.
12:17:14     ..................................................  Size: 49K
...
14:29:58     ..................................................  Size: 15449K
14:30:22     ................................. Size of output: 15483K
14:30:32 Build succeeded.
14:30:33 build submit error = 0
14:30:33 Starting testing step.
14:41:55 Tests succeeded.
14:41:57 test submit error = 0
14:41:57 File upload submit error = 0
14:41:57 Single configure/build/test failed. The error code was: 255
14:41:57 Build step 'Execute shell' marked build as failure
14:41:57 Archiving artifacts
14:41:58 Finished: FAILURE

So the configure, build, and tests all "succeeded" and yet the PR tester script code reported:

Single configure/build/test failed. The error code was: 255

How is that possible?

Is the autotester trusting the return code from ctest -S <script>.cmake? We know that is not reliable as described in:

https://gitlab.kitware.com/snl/project-1/issues/54

You need to use a different method to determine pass/fail of a ctest -S script other than the return code.

prwolfe commented 5 years ago

This appears to be an error early in the script

ctest_empty_binary_directory problem removing the binary directory

Which we have discussed before. I have yet to find a way to reset this error, but I have not had much time to spend on it either.

bartlettroscoe commented 5 years ago

@prwolfe said:

This appears to be an error early in the script

Okay, I see it now:

14:04:12 CMake Error at /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_intel_17.0.1/TFW_testing_single_configure_prototype/simple_testing.cmake:118 (ctest_empty_binary_directory):
14:04:12   ctest_empty_binary_directory problem removing the binary directory:
14:04:12   /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_intel_17.0.1/pull_request_test

If you want to make this more robust, you are going to need to manually remove the build directory. But the fact that this is occurring suggests that the CMakeCache.txt file is not getting created on a previous Jenkins run. Is that possible with the current setup?

prwolfe commented 5 years ago

Yes - it is possible. And changing that would be difficult. I take it there is not a way to reset the error from your past experience?

bartlettroscoe commented 5 years ago

@prwolfe asked:

I take it there is not a way to reset the error from your past experience?

No. Actually, error handling in cmake/ctest is not very well defined in some respects. See:

https://gitlab.kitware.com/snl/project-1/issues/52

My advice is that if you want to build from scratch robustly, just manually delete the build directory yourself either in the cmake code or in the bash code.

bartlettroscoe commented 5 years ago

FYI: Took 19 hours from the time that I created the PR #5040 till the PR tester started testing the build as shown here. Is the PR testing system that backed up? What could cause a 19 hour delay before it even starts testing a PR?

FYI: I manually merged that branch into the 'atdm-nightly' branch last night so this has installs on 'waterman' going so having a massive delay in the PR tester like this is not a show stopper in this case.

bartlettroscoe commented 5 years ago

CC: @trilinos/framework

FYI: More random failures see in #5040. There are some new ones. The testing iteration last night shown in https://github.com/trilinos/Trilinos/pull/5040#issuecomment-488118973 showed the builds Trilinos_pullrequest_intel_17.0.1 (Build Num: 3228) and Trilinos_pullrequest_gcc_7.2.0 (Build Num: 1408) failing with build and test failures. Yet without changing anything on that branch, the next testing iteration shown in https://github.com/trilinos/Trilinos/pull/5040#issuecomment-488206488 passed.

The failures in the Trilinos_pullrequest_intel_17.0.1 build shown here failed the build with errors like:

/projects/sems/install/rhel6-x86_64/sems/compiler/intel/17.0.1/mpich/3.2/bin/mpicxx: line 282: 33698 Segmentation fault      (core dumped) $Show $CXX ${final_cppflags} $PROFILE_INCPATHS ${final_cxxflags} "${allargs[@]}" -I$includedir

What is up with that?

And the build Trilinos_pullrequest_gcc_7.2.0 shown here showed hundreds of test failures showing errors like:

-------------------------------------------------------------------------
Open MPI was unable to obtain the username in order to create a path
for its required temporary directories.  This type of error is usually
caused by a transient failure of network-based authentication services
(e.g., LDAP or NIS failure due to network congestion), but can also be
an indication of system misconfiguration.

Please consult your system administrator about these issues and try
again.
--------------------------------------------------------------------------
[ascic158:80974] 5 more processes have sent help message help-orte-runtime.txt / orte:session:dir:nopwname
[ascic158:80974] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

But all of this passed the next testing iteration. Why is that?

Are these bad nodes? Seems like some analysis would be good to try to avoid this on future PR jobs.

bartlettroscoe commented 5 years ago

@trilinos/framework

Looks like PR builds are crashing due to new "clean_workspace" module load problems. See https://github.com/trilinos/Trilinos/pull/5068#issuecomment-488587688.

For that PR iteration, all of the builds are crashing except for the Intel 17.0.1 build showing:

Cleaning directory pull_request_test due to command line option
Traceback (most recent call last):
  File "Trilinos/commonTools/framework/clean_workspace/clean_workspace", line 78, in 
    clean.run()
  File "Trilinos/commonTools/framework/clean_workspace/clean_workspace", line 41, in run
    self.force_clean_space()
  File "Trilinos/commonTools/framework/clean_workspace/clean_workspace", line 65, in force_clean_space
    module('load', 'sems-env')
  File "/scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.8.4/Trilinos/commonTools/framework/clean_workspace/Modules.py", line 181, in module
    return Module().module(command, *arguments)
  File "/scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.8.4/Trilinos/commonTools/framework/clean_workspace/Modules.py", line 151, in module
    raise RuntimeError(stderr)
RuntimeError: ModuleCmd_Load.c(208):ERROR:105: Unable to locate a modulefile for 'sems-env'
Build step 'Execute shell' marked build as failure
Archiving artifacts
Finished: FAILURE

It looks like many/all of the recent PR builds are crashing due to this (see #5071, #5063, #5066, #5064, #5062, #5054, ...). There does not seem to be any pattern to which builds are showing this problem.

bartlettroscoe commented 5 years ago

FYI: The Trilinos PR testing system seems to be a bit overloaded currently. For my recent PR #5215, it took 7 hours for the PR tester to even start testing the branch. It seems the rebuilds have not been implemented yet since it took the PR tester 5 hours to run the builds and tests. So the total time from when the PR was created to when it was pushed was 12 hours.

bartlettroscoe commented 5 years ago

FYI: I have been running an experiment with the post-push CI build that I run where I rebuild every iteration since 3/7/2019 (see c923294). That rebuilt every CI build of 230 consecutive builds as shown here. This just failed today due to the merging of some Panzer changes in commit 6cede2e from Pr #5228. @rppawlo confirmed that you need to wipe the Panzer build dir to fix this.

I manually wiped out the Panzer build directory with:

$ rm -r BUILD/packages/panzer/

so next time the post-push CI build runs, it will rebuild everything in Panzer (but it will be very fast because all of the other object files are still there).

Note that as shown in this query, the median rebuild time is about 22 minutes compared to 4 hours for a complete from-scratch build.

That shows how successful rebuilds would be for the Trilinos PR system. It would be huge.

prwolfe commented 5 years ago

Ross,

We turned on incremental rebuilds two days ago, but Rogers merge meant we had to do rebuilds. Ride has been running longer and does seem to be very useful.

bartlettroscoe commented 5 years ago

From: Bartlett, Roscoe A Sent: Monday, August 05, 2019 12:26 PM To: trilinos-framework@software.sandia.gov Subject: FYI: Randomly failing test MueLu_Maxwell3D-Tpetra- Stratimikos_MPI_4

Hello Trilinos Framework team,

FYI: Looks like the test MueLu_Maxwell3D-Tpetra-Stratimikos_MPI_4 is randomly failing in the GCC 4.8.4 build (which should match the GCC 4.8.4 PR build). As shown at:

https://testing- dev.sandia.gov/cdash/index.php?project=Trilinos&filtercount=3&showfilters =1&filtercombine=and&field1=buildname&compare1=61&value1=Linux-GCC- 4.8.4- MPI_RELEASE_DEBUG_SHARED_PT_OPENMP_CI&field2=groupname&compar e2=61&value2=Continuous&field3=buildstarttime&compare3=83&value3=16 %20weeks%20ago

it failed twice since 7/25/2019.

This will be bringing down PR builds randomly.

Can someone create a GitHub issue to address this?

-Ross

Dr. Roscoe A. Bartlett, PhD https://bartlettroscoe.github.io/ Sandia National Laboratories

-----Original Message----- From: CDash admin@cdash.org Sent: Monday, August 05, 2019 11:18 AM To: Bartlett, Roscoe A rabartl@sandia.gov Subject: [EXTERNAL] FAILED (t=1): Trilinos/MueLu - Linux-GCC-4.8.4- MPI_RELEASE_DEBUG_SHARED_PT_OPENMP_CI - Continuous

A submission to CDash for the project Trilinos has failing tests. You have been identified as one of the authors who have checked in changes that are part of this submission or you are listed in the default contact list.

Details on the submission can be found at https://testing.sandia.gov/cdash/buildSummary.php?buildid=5418940

Project: Trilinos SubProject: MueLu Site: ceerws1113 Build Name: Linux-GCC-4.8.4- MPI_RELEASE_DEBUG_SHARED_PT_OPENMP_CI Build Time: 2019-08-05T14:46:26 UTC Type: Continuous Tests not passing: 1

Tests failing MueLu_Maxwell3D-Tpetra-Stratimikos_MPI_4 | Completed (Failed) | (https://testing.sandia.gov/cdash/testDetails.php?test=85726525&build= 541 8940)

-CDash on testing.sandia.gov

bartlettroscoe commented 5 years ago

@trilinos/framework

Looks like we have got a randomly failing test SEACASIoss_pamgen_exodus_io_info in the CUDA and GCC PR builds as already reported in #5794 (in the context of ATDM). It just took out my PR build #5916 shown here.

Looks like this test SEACASIoss_pamgen_exodus_io_info has taken out several PR build iterations recently as shown here.

You might want to disable this test in PR testing going forward until it gets fixed ...

When is the Trilinos PR tester going to be updated to only redo builds that failed and do rebuilds, instead of builds from scratch?

gsjaardema commented 5 years ago

I have a fix for this, will add a PR later today hopefully.

bartlettroscoe commented 5 years ago

@trilinos/framework

Looks like there is a problem with the PR testing with MPI startup. My PR #6034 showed two failing tests in different PR builds:

both showing output like:

[tr-test-7.novalocal:06582] opal_os_dirpath_create: Error: Unable to create the sub-directory (/tmp/openmpi-sessions-trilinos@tr-test-7_0/5848) of (/tmp/openmpi-sessions-trilinos@tr-test-7_0/5848/0/0), mkdir failed [1]
[tr-test-7.novalocal:06582] [[5848,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 107
[tr-test-7.novalocal:06582] [[5848,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 402
[tr-test-7.novalocal:06582] [[5848,0],0] ORTE_ERROR_LOG: Error in file ess_hnp_module.c at line 638
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

I don't think is is possible that my code changes could have triggered this MPI startup failure.

Any idea what this is about?

prwolfe commented 5 years ago

These failures occurred on 2 very different systems (one physical and one VM) and both appear to have sufficient space and permissions for the file to be created. Without knowing more about the routine in MPI I would simply be guessing here.

bartlettroscoe commented 5 years ago

FYI: As shown in this query looks like only my PR is impacted so far. Will see what happens on the next PR iteration.

bartlettroscoe commented 5 years ago

FYI: The last PR iteration for #6034 passed all the tests. However, someone should keep an eye on the PR testing to see if the strange "Unable to create the sub-directory" mentioned above takes out any other PR iterations. (That error took out the first 2 PR testing iterations for #6034.)

When is the PR tester going to be extended to only rerun testing for builds that failed, leaving the builds that passed in place? This is critical to improve the robustness of the PR tester and allow for more PR builds to be added without seriously damaging PR testing stability.

Also, when are these PR builds going to utilize rebuilds? The post-push CI testing that has been running for months shown here has a median time to rebuild of about 20 minutes (where the build -from-scratch is about 3.5 hours) and has been very robust for several months now rebuilding over and over again. That would massively speed up PR testing.

bartlettroscoe commented 5 years ago

FYI: Another randomly failing test killing a PR iteration https://github.com/trilinos/Trilinos/pull/6045#issuecomment-538617025 and requiring all of the PR builds to be run from scratch! This is a huge waste of resources. If the PR tester would rebuild and just rerun the build that failed, it would have taken just a couple of minutes and used almost no computing resources.

FYI: We should have this randomly failing set of tests fixed in PR #6050

bartlettroscoe commented 5 years ago

@trilinos/framework

The last PR iteration for #6050 shown here was very fast. Was that using a rebuild? Can the PR tester do a rebuild in some cases? If so, why not in every case?

bartlettroscoe commented 5 years ago

starting last night

prwolfe commented 5 years ago

starting last night

starting what?

bartlettroscoe commented 5 years ago

starting last night

starting what?

Broken PR testing

jhux2 commented 5 years ago

@prwolfe See, for example, #6109.

prwolfe commented 5 years ago

Hmm, this looks like the entire network went down for a bit last night. I see failures submitting to dash, java errors in Jenkins, nfs errors across the board and a few clone errors due to timeouts. I will ask a few questions.

jhux2 commented 5 years ago

Any update on this? The links to cdash PR results are dead, but maybe this is an unrelated issue?

prwolfe commented 5 years ago

@jhux2 - I just came back from talking with the CEE team. They have had network issues all week and had to reset the Jenkins instance yesterday from a snapshot. That snapshot has the PR testing turned off. They are seeing fewer issues today but have no reason to think that corporate has everything fixed for good.

I cannot try turning that back on if CDash is down as it will fail for that reason (it's all connected). I just sent an email to Ross and others about the CDash instance and will try everything when that comes back.

jhux2 commented 5 years ago

@prwolfe Thank you for the update!

bartlettroscoe commented 5 years ago

@trilinos/framework,

Given that the most recent merge was:

96f63b1 "Merge Pull Request #6105 from bartlettroscoe/Trilinos/6017-atdm-allow-enable-percept"
Author: trilinos-autotester <trilinos-autotester@trilinos.org>
Date:   Wed Oct 16 11:00:20 2019 -0600 (3 days ago)

it seems that PR testing is not back online yet.

Might I suggest that you switch the PR testing from testing.sandia.gov/cdash/ to testing-dev.sandia.gov/cdash/ for now? That will get PR testing back online. Also, please consider doing dual submits for the PR builds to both sites. It is nice having a backup site.

prwolfe commented 5 years ago

@bartlettroscoe - I changed to testing-dev this morning and am seeing the same results. Up to and including the testing-dev site giving me a "500 error". I am also seeing more general network errors this morning, so I will follow up on that.

bartlettroscoe commented 5 years ago

???

Strange, because I just now when to:

https://testing-dev.sandia.gov/cdash/index.php?project=Trilinos&date=2019-10-21&filtercount=1&showfilters=1&field1=site&compare1=63&value1=

and it came up fine. But I am not seeing any "Pull Request" results there yet. Is a PR build running that should be submitting results?

Anyone else not able to get to https://testing-dev.sandia.gov/cdash/index.php?project=Trilinos ?

jhux2 commented 5 years ago

Anyone else not able to get to https://testing-dev.sandia.gov/cdash/index.php?project=Trilinos ?

works for me

bartlettroscoe commented 5 years ago

FYI: I am starting to see PR results showing up to:

https://testing-dev.sandia.gov/cdash/index.php?project=Trilinos&date=2019-10-21&filtercount=1&showfilters=1&field1=groupname&compare1=61&value1=Pull%20Request

So it looks like the PR tester might be back online?

prwolfe commented 5 years ago

No, just instance 0 – and only 3 of the 7 are reporting. I am still seeing stuff like that “500 error” in the log and it looks like the proxy setting for the cloud nodes might be different for testing-dev for whatever reason.

This kind of stuff should already be dealt with…

Recommending policy and creating automation to enable DevOps in the Trilinos and CTH organizations.

From: "Roscoe A. Bartlett" notifications@github.com Reply-To: trilinos/Trilinos reply@reply.github.com Date: Monday, October 21, 2019 at 11:36 AM To: trilinos/Trilinos Trilinos@noreply.github.com Cc: "Wolfenbarger, Paul R" prwolfe@sandia.gov, Mention mention@noreply.github.com Subject: [EXTERNAL] Re: [trilinos/Trilinos] Trilinos auto PR tester stability issues (#3276)

FYI: I am starting to see PR results showing up to:

https://testing-dev.sandia.gov/cdash/index.php?project=Trilinos&date=2019-10-21&filtercount=1&showfilters=1&field1=groupname&compare1=61&value1=Pull%20Request

So it looks like the PR tester might be back online?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/3276?email_source=notifications&email_token=AB2OSCLHOQRVQK7JF6CGRGTQPXR7ZA5CNFSM4FPBJEYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB3E5OI#issuecomment-544624313, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB2OSCP6CCOYKYB7TTIN2XDQPXR7ZANCNFSM4FPBJEYA.

kddevin commented 5 years ago

In my three waiting PRs, I do not see evidence that the PR tester is back on line.

dridzal commented 5 years ago

I’m not seeing anything either for my PR submitted last week. Also, ROL’s nightly tests aren’t posting to testing. I get an http error when looking for the test results. Is this related?

bartlettroscoe commented 5 years ago

Also, ROL’s nightly tests aren’t posting to testing. I get an http error when looking for the test results. Is this related?

@dridzal, replace 'testing.sandia.gov/cdash/' with 'testing-dev.sandia.gov/cdash/' and your results should be there.

trilinos / Trilinos

Trilinos auto PR tester stability issues #3276

Description

PR Builds Showing Random Failures