Trilinos auto PR tester stability issues

bartlettroscoe commented 6 years ago

@trilinos/framework

Description

Over the last few weeks and months, the Trilinos auto PR tester has seen several cases where one or more PR builds for a given PR testing iteration failed to produce results on CDash or showed build or test failures that were not related to the changes on that particular PR.

This Story is to log these fails and keep track of them in order to provide some statistics about these cases in order to inform how to address them. This should replace making comments in individual PRs that exhibit these types of problems like #3260 and #3213.

PR Builds Showing Random Failures

Below are a few examples of the stability problems (but are not all of the problems).

PR ID	Num PR Builds to reach passing	First test trigger	Start first test	Passing test	Merge PR
#3258	2	8/8/2018 2:35 PM ET	8/8/2018 2:44 PM	[8/8/2018 9:15 PM ET]()	Not merged
#3260	4	8/8/2018 5:22 PM ET	8/8/2018 6:31 PM ET	8/10/2018 4:13 AM ET	8/10/2018 8:25 AM
#3213	3	7/31/2018 4:30 PM ET	7/31/2018 4:57 PM ET	8/1/2018 9:48 AM ET	8/1/2018 9:53 AM ET
#3098	4	7/12/2018 12:52 PM ET	7/12/2018 1:07 PM ET	7/13/2018 11:12 PM ET	7/14/2018 10:59 PM ET
#3369	6	8/29/2018 9:08 AM ET	8/29/2018 9:16 AM ET	8/31/2018 6:09 AM ET	8/31/2018 8:33 AM ET

bartlettroscoe commented 6 years ago

Over the past few months, many of the ATDM Trilinos build script update PRs have experienced several cases of failed PR builds that had nothing to do with the changes in the PR branch. Below, I will list out the number of PR iterations before all of the PR builds for a given PR iteration passed, allowing the build. I will also list the start and end times between when the last commit was pushed, or the PR was created (which should tigger a PR build), the time when the final passing PR build was:

PR ID	Num PR Builds to reach Passing	Num (False) failed PR Builds	First test trigger	Start first test	Passing test	Merge PR
#3260	4	3	8/8/2018 5:22 PM ET	8/8/2018 6:31 PM ET	8/10/2018 4:13 AM ET	8/10/2018 8:25 AM
#3213	3	2	7/31/2018 4:30 PM ET	7/31/2018 4:57 PM ET	8/1/2018 9:48 AM ET	8/1/2018 9:53 AM ET
#3098	4	3	7/12/2018 12:52 PM ET	7/12/2018 1:07 PM ET	7/13/2018 11:12 PM ET	7/14/2018 10:59 PM ET

bartlettroscoe commented 6 years ago

I think the ATDM Trilinos PRs tend to see more of these types of failures because the current PR testing system triggers the build and testing of every Trilinos package for any change to any file under the cmake/std/atdm/ directory. That will be resolved once #3133 is resolved (and PR #3258 is merged). But other PRs that trigger the enable of a lot of Trilinos packages would still be an issue perhaps.

bartlettroscoe commented 6 years ago

We are seeing similar problems going on over the last few days with PR #3258 with the PR builds shown here. The trend that you can see is that the PR builds that run Intel 17.0.1 build all pass just fine. The problem comes with the GCC 4.8.4 and GCC 4.9.3 builds. And all of the failures for these builds occurs only on the Jenkins node 'ascic142'. The GCC PR builds (including the GCC 4.8.4 and 4.9.3 builds) that occur on the other nodes 'ascic143', 'ascic157', and 'ascic158' all pass. That suggests that there is something different about the node 'ascic142' that is causing these builds to fail that is not occurring on the other nodes.

Something similar occurred with PR #3260 with PR build results shown here. In that case, 3 of the 4 failing PR builds were on 'ascic142', and that included build failures with empty build error messages. The other failing PR build was on 'ascic158' and that was two timing-out tests.

All of this suggests:

There may be something wrong with 'ascic142' or at least something different from the other build nodes that may be causing more failures.
The machines may be getting loaded too much and that is causing builds to crash and tests to timeout.

@trilinos/framework, can someone look into into these issues some? This problem as it impacts the ATDM work will mostly go away once PR #3258 is merged, but getting that merged requires the PR builds to pass, which it is having trouble doing.

Below is the data for the first round of failures being see in PR #3258.

PR ID	Num PR Builds to reach Passing	Num (False) failed PR Builds	First test trigger	Start first test	Passing test	Merge PR
#3258	2	1	8/8/2018 2:35 PM ET	8/8/2018 2:44 PM	[8/8/2018 9:15 PM ET]()	Not meged

And after a push of commits last night, the first PR testing iteration failed for PR #3258 as well so a new cycle has started.

bartlettroscoe commented 6 years ago

And node 'ascic142' strikes again, this time killing the PR test iteration for the new PR #3278 shown here. The results for the failing build Trilinos_pullrequest_gcc_4.9.3 are not showing up on CDash here. The Jenkins output for this failing build shown here shows that it ran on 'ascic142' and showed the output:

[Trilinos_pullrequest_gcc_4.9.3] $ /usr/bin/env bash /tmp/jenkins4565491669098842090.sh
trilinos
/usr/bin/env
/bin/env
/bin/env
/usr/bin/env
changed-files.txt
gitchanges.txt
packageEnables.cmake
pull_request_test
TFW_single_configure_support_scripts
TFW_single_configure_support_scripts@tmp
TFW_testing_single_configure_prototype
TFW_testing_single_configure_prototype@tmp
TribitsDumpDepsXmlScript.log
Trilinos
TrilinosPackageDependencies.xml
git remote exists, removing it.
error: RPC failed; curl 56 Proxy CONNECT aborted
fatal: The remote end hung up unexpectedly
Source remote fetch failed. The error code was: 128
Build step 'Execute shell' marked build as failure
Finished: FAILURE

Not sure what error: RPC failed; curl 56 Proxy CONNECT aborted means but it seems to have killed the git fetch.

@trilinos/framework, could we consider a retry and wait loop for these git communication operations?

bartlettroscoe commented 6 years ago

FYI: The latest PR testing iteration for #3225 shown here failed due to bogus build failures on 'ascic142' as shown here.

jwillenbring commented 6 years ago

@bartlettroscoe Thank you for this information. I started to investigate this further by putting more than half of the executors on ascic142 to sleep for 10 days (we can always kill the job if we need to). This means that only 1 PR testing job can run at a time on that node for the next 10 days. The small, 1 executor instance jobs can run along with one PR testing job for now. We could clamp that down too if the issues persist. If we do that (allow only 1 PR job to run on the node and nothing else) and the failures persist, I think we need to consider taking the node down or asking someone to look into the issues. 142 runs a lot of jobs for us. It is possible that 143 for example typically does not have 2 jobs running at the same time, but if it did, it would fail more often too. We'll see what happens anyway.

bartlettroscoe commented 6 years ago

@jwillenbring, thanks for taking these steps. It is hard to monitor the stability of the PR testing process just by looking at CDash since we expect failures in some PRs due to code changes and if results don't show up at all we don't see them.

Is it possible to log cases where PR builds don't successfully submit results to CDash or provide comments to GitHub for some reason? This might just be a global log file that the PR testing Jenkins jobs write to whenever a error is detected. For that matter, it would be good to also log every successful PR run (which just means nothing crashed and no communication failed). This type of data would be useful from a research perspective about the stability of the CI testing processes and it would provide a clear metric to see if changes to the PR process are improving stability or not.

bartlettroscoe commented 6 years ago

@trilinos/framework,

More stability problems. Even after the merge of #3258 and the completion of #3133 (still in review but complete otherwise) so that no packages should be built or tested for changes to ATDM build scripts, we still got a failed PR iteration as shown for the PR #3309 here. The new Jenkins output in that comment showed that the build Trilinos_pullrequest_gcc_4.8.4 failed due to a git fetch error:

 > git fetch --tags --progress git@gitlab-ex.sandia.gov:trilinos-project/TFW_single_configure_support_scripts.git +refs/heads/*:refs/remotes/origin/*
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git fetch --tags --progress git@gitlab-ex.sandia.gov:trilinos-project/TFW_single_configure_support_scripts.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: ssh: connect to host gitlab-ex.sandia.gov port 22: Connection timed out
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.

What is the timeout for that git fetch? None is listed. Can we set a timeout of 30 minutes or so?

Also, do all of the git fetches occur at the same time for all three PR builds? If so, you might avoid these problems by staggering the start of the PR builds by one minute each. That will add little time to the overall PR time but may make the git fetch more robust.

If Jenkins continues to be this fragile for git operations, you might want to do the git operations in your own driver scripts and put in a loop of tires. That is how CTest gets robust git fetches I think.

bartlettroscoe commented 6 years ago

Another data-point ...

The PR #3312 iteration https://github.com/trilinos/Trilinos/pull/3312#issuecomment-413678084 showed the GitHub fetch failure:

 > git fetch --tags --progress https://github.com/trilinos/Trilinos +refs/heads/*:refs/remotes/origin/* # timeout=20
ERROR: Timeout after 20 minutes
ERROR: Error fetching remote repo 'origin'

Jenkins needs to be made more robust with git fetches with loops of retires or someone needs to write scripts that will do this manually with loops (that is what ctest -S does to robustly submit to CDash). I can almost guarantee that is what TravisCI and other refined CI systems do to robustly run against github and other external sites.

jwillenbring commented 6 years ago

@bartlettroscoe

Jenkins needs to be made more robust with git fetches with loops of retires or someone needs to write scripts that will do this manually with loops

@william76 is looking into this with the pipeline capability and @allevin with the autotester. I cannot figure out why Jenkins does so many clones instead of updates once it has a repo. Aaron said it is not like this for SST.

jwillenbring commented 6 years ago

@bartlettroscoe

Another data-point ...

The PR #3312 iteration #3312 (comment) showed the GitHub fetch failure:

This communication failure happened on ascic157, so ascic142 is not the only node having issues with communication.

jwillenbring commented 6 years ago

What is the timeout for that git fetch? None is listed. Can we set a timeout of 30 minutes or so?

By default 10. We upped it to 20 and there seemed to be little effect.

Also, do all of the git fetches occur at the same time for all three PR builds? If so, you might avoid these problems by staggering the start of the PR builds by one minute each. That will add little time to the overall PR time but may make the git fetch more robust.

The individual PR testing builds get put in the queue and get assigned to a node. Sometimes this happens almost instantly, sometimes it takes a while.

bartlettroscoe commented 6 years ago

@jwillenbring, I think if you do a 'git fetch' and there is a network issue right then, then it will just return with an error. I think what we need is a loop with waits and retires.

Also, could you turn on the Jenkins project option "Add timestamps to the Console Output"? That might to see if these commands are timing out or are crashing before the timeout.

bartlettroscoe commented 6 years ago

@jwillenbring said:

This communication failure happened on ascic157, so ascic142 is not the only node having issues with communication.

I think most of the failures on 'ascic142' that are reported above are build or test failures after the clones/updates are successful. The problem on 'ascic142' is not communication, it is overloading (or something related).

bartlettroscoe commented 6 years ago

And another git fetch failures in https://github.com/trilinos/Trilinos/pull/3316#issuecomment-414034806 showing:

 > git fetch --tags --progress git@gitlab-ex.sandia.gov:trilinos-project/TFW_single_configure_support_scripts.git +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'

Note that is not the Trilinos repo but the tiny little TFW_single_configure_support_scripts.git git repo. That can't be a timeout. Why is Jenkins so non-robust with git?

bartlettroscoe commented 6 years ago

NOTE: The ctest -S script support built into CTest which does clones and updates runs on the exact same machines and networks as this PR testing system and I don't think we see even a small fraction of this number of git fetch failures. For example, just look at the 'Clean' builds that run on the same ascic-jenkins build farm machines over the last 5.5 months in this query. Out of those 475 builds we see 1 build with a git update failure (on 6/14/2018). That is far more robust than we are seeing from the PR tester. Therefore, one has to conclude that the problem is not the machines nor the network. The problem must be the software doing the updating (and not having robustness built into them).

mhoemmen commented 6 years ago

In PR tests, the build is broken and tests are failing, that have nothing to do with the PR in question. See this example:

https://github.com/trilinos/Trilinos/pull/3359#issuecomment-416840456

Sometimes the PR tests fail because of lack of communication with some server, but now they are failing because of MueLu build errors and Tempus test failures. The latter may pass or fail intermittently, but it’s not clear to me how those build errors could have gotten through PR testing.

After discussion with Jim Willenbring, it looks like these build errors may come from test machines being overloaded. Ditto for Tempus failures perhaps, though I haven't investigated those in depth.

bartlettroscoe commented 6 years ago

FYI: It looks like the problems with the PR tester are not just random crashes. It looks like can also retest a PR branch even after PR testing passed and there was no trigger to force a new set of PR testing builds. See https://github.com/trilinos/Trilinos/pull/3356#issuecomment-417310605 for an example of this.

bartlettroscoe commented 6 years ago

FYI: My last PR #3369 required 6 auto PR testing iteration attempts before it allowed the merge over two days after the start.

bartlettroscoe commented 6 years ago

FYI: Stability problems continue https://github.com/trilinos/Trilinos/pull/3455#issuecomment-422186613. That one PR iteration had a git clone error for one build and a CDash submit failure for another build.

mhoemmen commented 6 years ago

bartlettroscoe commented 6 years ago

@trilinos/framework,

FYI: More stability issues with the auto PR tester.

In https://github.com/trilinos/Trilinos/pull/3546#issuecomment-426380831 you see the Trilinos_pullrequest_intel_17.0.1 failing due to a git pull failure:

fatal: unable to access 'https://github.com/hkthorn/Trilinos/': Proxy CONNECT aborted
Source remote fetch failed. The error code was: 128

and the Trilinos_pullrequest_gcc_4.8.4 build you see a problem submitting results to CDash showing:

Error when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.8.4/pull_request_test/Testing/20181002-1822/Configure.xml
Error message was: Failed to connect to testing-vm.sandia.gov port 80: Connection timed out
Problems when submitting via HTTP
configure submit error = -1
CMake Error at /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.8.4/TFW_testing_single_configure_prototype/simple_testing.cmake:172 (message):
Configure failed with error -1

And in another PR testing iteration shown in https://github.com/trilinos/Trilinos/pull/3549#issuecomment-426397742 you see the Trilinos_pullrequest_gcc_4.8.4 build also failing due to inability to submit to CDash showing:

rror when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.8.4/pull_request_test/Testing/20181002-1915/Configure.xml
Error message was: Failed to connect to testing-vm.sandia.gov port 80: Connection timed out
Problems when submitting via HTTP
configure submit error = -1
CMake Error at /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.8.4/TFW_testing_single_configure_prototype/simple_testing.cmake:172 (message):
Configure failed with error -1

And actually, we are also seeing configure failures for these two unrelated PR iterations showing configure failures for the Trilinos_pullrequest_gcc_4.8.4 build showing:

Starting configure step.
Each . represents 1024 bytes of output
................. Size of output: 16K
Error(s) when configuring the project

It is impossible for those PR branches to trigger a configure failure because the PR builds don't currently use that ATDM configuration scripts (unless something has changed but I don't think so).

What is going on with the auto tester with the Trilinos_pullrequest_gcc_4.8.4 build?

bartlettroscoe commented 6 years ago

CC: @trilinos/framework

And here is a new one: https://github.com/trilinos/Trilinos/pull/3546#issuecomment-426412703

This time the Trilinos_pullrequest_gcc_4.9.3 build crashed due to:

Checking out Revision bb78697920292c58562e47fbae13843a79c29e55 (refs/remotes/origin/develop)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f bb78697920292c58562e47fbae13843a79c29e55
hudson.plugins.git.GitException: Command "git checkout -f bb78697920292c58562e47fbae13843a79c29e55" returned status code 128:
stdout: 
stderr: fatal: Unable to create '/scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.9.3/Trilinos/.git/index.lock': File exists.
If no other git process is currently running, this probably means a
git process crashed in this repository earlier. Make sure no other git
process is running and remove the file manually to continue.

Does this mean that the local git repo is corrupted? Are two PR jobs running on top of each other?

prwolfe commented 6 years ago

We see this a bit on the Sierra clones. It means either two processes ran at once (not likely here, but possible as this is threaded) or that the NFS filesystem has not been fast enough for git and so a subsequent query is seeing the lock file from a previous one. Since this is scratch I'm betting against that one.

We tend to live with this one, and curse it.

prwolfe commented 6 years ago

@trilinos/framework,

FYI: More stability issues with the auto PR tester.

In #3546 (comment) you see the Trilinos_pullrequest_intel_17.0.1 failing due to a git pull failure: It is impossible for those PR branches to trigger a configure failure because the PR builds don't currently use that ATDM configuration scripts (unless something has changed but I don't think so).

What is going on with the auto tester with the Trilinos_pullrequest_gcc_4.8.4 build?

The real failure here is not the configure itself, but the http submission which CTest interprets as a configure failure - go figure.

As to why the proxy fails sometimes like this I have no idea.

Paul

bartlettroscoe commented 6 years ago

FYI: Last week, PR #3559 has the Trilinos auto PR tester fail 8 consecutive times before it finally passed on the 9th try. On the bright side, since this was only changing files under Trilinos/cmake/std/atdm/ which currently does not trigger any testing of any Trilinos package, the iterations were pretty fast. But it still took over two days even with that to get a passing build to allow a merge.

bartlettroscoe commented 6 years ago

FYI: @jwillenbring informed me that the problem last week was that the network changed so that that the machines could not communicate with each other and basically brought down PR testing for 3 days.

But even now there are still stability problems. For example, we had to rebase the branch for PR #3559 yesterday to address a merge conflict which kicked off a new round of PR testing and the first PR build shown here failed with the Trilinos_pullrequest_intel_17.0.1 build failing due to a git fetch error:

 > git fetch --tags --progress https://github.com/trilinos/Trilinos +refs/heads/*:refs/remotes/origin/* # timeout=20
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from https://github.com/trilinos/Trilinos
    at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:888)
    at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1155)
    at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1186)
    at org.jenkinsci.plugins.multiplescms.MultiSCM.checkout(MultiSCM.java:143)
    at hudson.scm.SCM.checkout(SCM.java:504)
    at hudson.model.AbstractProject.checkout(AbstractProject.java:1208)
    at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:574)
    at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
    at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:499)
    at hudson.model.Run.execute(Run.java:1794)
    at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
    at hudson.model.ResourceController.execute(ResourceController.java:97)
    at hudson.model.Executor.run(Executor.java:429)
Caused by: hudson.plugins.git.GitException: Command "git fetch --tags --progress https://github.com/trilinos/Trilinos +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: error: The requested URL returned error: 403 Forbidden while accessing https://github.com/trilinos/Trilinos/info/refs
fatal: HTTP request failed

We really need to make the PR tester more robust to git communication failures by putting in retries. There is no evidence that any retries are being attempted.

But the good news is that the next PR iteration passed which allowed the merge.

bartlettroscoe commented 6 years ago

FYI: Git fetch failed for Trilinos_pullrequest_intel_17.0.1 build on ascic158-trilinos just now shown at https://github.com/trilinos/Trilinos/pull/3612#issuecomment-429351755.

bartlettroscoe commented 6 years ago

FYI: Git fetch failed for Trilinos_pullrequest_gcc_4.8.4 build on ascic142-trilinos just now as shown at https://github.com/trilinos/Trilinos/pull/3642#issuecomment-430354188.

bartlettroscoe commented 6 years ago

@trilinos/framework

As shown in this query, as of today since 9/30/2018 out of 684 builds, the ATDM Trilinos builds using the TriBITS CTest -S driver there was just a single git update error from the Trilinos Github site (shown with red SHA1 under the "Update" column for the build 'Trilinos-atdm-hansen-shiller-cuda-8.0-opt' on 'hansen' today). Based on that data, that is a git update failure rate of less than 0.14% percent.

From just my recent experience with the Trilinos PR issues that I have been CCed on that the Trilinos PR tester is experiencing GitHub fetch failure rates much higher than that (but it would take a lot of manual work to compute a failure rate over the last two weeks). Why? The ATDM Trilinos builds are running a bunch of clusters including SNL HPC machines, Test Bed machines that one would think would be less reliable than the ascic Jenkins cluster machines.

bartlettroscoe commented 6 years ago

FYI: Git fetch failure for github Trilinos for build Trilinos_pullrequest_gcc_4.8.4 on ascic144-trilinos shown just now at https://github.com/trilinos/Trilinos/pull/3657#issuecomment-430814760. I see the timeout for that op is 20 minutes. Can you increase this?

Again, why are the other automated builds on the SRN showing less than 0.5% git fetch failures but we are seeing so many of these failures in the PR builds?

prwolfe commented 6 years ago

FYI: Git fetch failure for github Trilinos for build Trilinos_pullrequest_gcc_4.8.4 on ascic144-trilinos shown just now at #3657 (comment). I see the timeout for that op is 20 minutes. Can you increase this?

Again, why are the other automated builds on the SRN showing less than 0.5% git fetch failures but we are seeing so many of these failures in the PR builds?

I don't think increasing the timeout will help as that was not the failure. It's a 403 error

The requested URL returned error: 403 Forbidden

We did increase these in the past as a full clone takes about 8-12 minutes and the default was 10.

bartlettroscoe commented 6 years ago

I don't think increasing the timeout will help as that was not the failure. It's a 403 error

@prwolfe, why are we not seeing more of these git communication failures in the other automated builds of Trilinos posting to CDash all across the SRN? What is different about the Trilinos PR builds that is seeing so many of these? The Triilnos PR builds need to be made the most robust of all the Trilinos automated builds but they are the least robust.

prwolfe commented 6 years ago

The proxy errors and 403 errors are amendable to retry loops, but there is an issue with getting more than one executor from pipeline scripts. @william76 has been working on that side of this.

As for the need for robustness, we should talk about this and the overall goal next week.

jhux2 commented 6 years ago

In case it's helpful, the PR tester woofed last night on PR #3667 on the Intel build (comment).

bartlettroscoe commented 6 years ago

@jhux2 said:

In case it's helpful, the PR tester woofed last night on PR #3667 on the Intel build (comment).

Looks like the Trilinos_pullrequest_intel_17.0.1 build in that case was not able to submit final results to CDash.

@trilinos/framework, have you turned on the CTest retry feature yet to submit to CDash? This feature has been turned on and used with the TriBITS CTest -S driver for almost 10 years (and we paid Kitware to add that feature way back).

prwolfe commented 6 years ago

@jhux2 said:

In case it's helpful, the PR tester woofed last night on PR #3667 on the Intel build (comment).

Looks like the Trilinos_pullrequest_intel_17.0.1 build in that case was not able to submit final results to CDash.

@trilinos/framework, have you turned on the CTest retry feature yet to submit to CDash? This feature has been turned on and used with the TriBITS CTest -S driver for almost 10 years (and we paid Kitware to add that feature way back).

@bartlettroscoe - I am unaware of this feature but we should do that. Please get with me on the details.

Thanks!

jhux2 commented 6 years ago

More issues, PR #3718.

Can't get a compiler license.
I thought changes like those in the PR shouldn't even require compile testing.

jhux2 commented 6 years ago

Another, PR #3726.

bartlettroscoe commented 6 years ago

@jhux2, did you close this by accident? I am reopening.

jhux2 commented 6 years ago

Sorry, hit the wrong button.

bartlettroscoe commented 6 years ago

FYI: https://github.com/trilinos/Trilinos/pull/3735#issuecomment-433139142 failed testing the atdm/README.md file with the build Trilinos_pullrequest_gcc_4.9.3 crashing due to another git fetch failure:

[new branch]      atdm-load-env-bash-login -> source_remote/atdm-load-env-bash-login
fatal: unable to access 'https://github.com/trilinos/Trilinos/': Proxy CONNECT aborted
Origin target remote fetch failed. The error code was: 128
Build step 'Execute shell' marked build as failure
Finished: FAILURE

bartlettroscoe commented 6 years ago

FYI: https://github.com/trilinos/Trilinos/pull/3775#issuecomment-434455518 failed with the Jenkins git fetches for https://github.com/trilinos/Trilinos for the PR builds Trilinos_pullrequest_gcc_4.8.4 and Trilinos_pullrequest_gcc_4.9.3_SERIAL showing:

 > git fetch --tags --progress https://github.com/trilinos/Trilinos +refs/heads/*:refs/remotes/origin/* # timeout=20
ERROR: Timeout after 20 minutes
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git fetch --tags --progress https://github.com/trilinos/Trilinos +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: error: RPC failed; result=56, HTTP code = 0
error: fetch-pack died of signal 15

@trilinos/framework, an alternative to cloning from scratch that I use sometimes is to copy from a pre-cloned local copy of Trilinos and then just update from there. That is so robust that it should be the default.

bartlettroscoe commented 6 years ago

FYI: https://github.com/trilinos/Trilinos/pull/3775#issuecomment-434485412 failed the next PR (non-testing) testing iteration for this branch. This time the build Trilinos_pullrequest_gcc_4.9.3 crashed due to the non-Jenkins fetch failure:

fatal: unable to access 'https://github.com/trilinos/Trilinos/': Proxy CONNECT aborted
Origin target remote fetch failed. The error code was: 128
Build step 'Execute shell' marked build as failure
Finished: FAILURE

bartlettroscoe commented 6 years ago

@william76,

Following up from your question in https://github.com/trilinos/Trilinos/pull/3821#issuecomment-436803134, as a data-point for CDash submits, the post-push CI build running on the CEE RHEL6 machine 'ceerws1113' has never failed to submit data to CDash in all recorded history which is 6 months and 413 builds currently as shown here. In hundreds of ATDM builds of Trilinos that submit to CDash from many different machines all of the SRN I don't know that we have seen single CDash submit failures. (The only cases where it does not submit is because of the batch jobs crashing before results could be submitted.)

There is something uniquely wrong with the Trilinos PR system. It it having some bad side-effects. For example, people are not following up on code reviews to fix things flagged by a reviewer because they are afraid that the Trilinos PR system will crash on subsequent builds so they just skip it. As a result, code reviewers are approving PRs just to avoid the risks of running more Trilinos PR iterations.

bartlettroscoe commented 5 years ago

@trilinos/framework,

FYI: The PR iteration https://github.com/trilinos/Trilinos/pull/3998#issuecomment-444528708 says it failed but I can't seem to find anything that actually failed.

One strange thing that I see in the set of builds on CDash for PR #3998 is that the Intel 17.0.1 builds seem to be run up to three different times for each PR iteration! It looks like the first PR iteration ran the Intel 17.0.1 build three times:

site	Build Name	Config Warn	Start Time
qscic158	PR-3998-test-Trilinos_pullrequest_intel_17.0.1-1790	3	Dec 05, 2018 - 15:19 UTC
ascic143	PR-3998-test-Trilinos_pullrequest_intel_17.0.1-1791	3	Dec 05, 2018 - 15:18 UTC
ascic158	PR-3998-test-Trilinos_pullrequest_intel_17.0.1-1789	3	Dec 05, 2018 - 15:12 UTC

Is that expected? Is this a defect in the Trilinos PR tester?

bartlettroscoe commented 5 years ago

CC: @fryeguy52

@trilinos/framework, more problems with autotester failing shown in https://github.com/trilinos/Trilinos/pull/4021. It failed both iterations it tried last night (after I took off the AT: WIP label).

In the first PR testing iteration shown in https://github.com/trilinos/Trilinos/pull/4021#issuecomment-446017319, the Trilinos_pullrequest_gcc_7.2.0 # 12 build crashed showing:

ERROR: Unable to find matching environment for job: Trilinos_pullrequest_gcc_7.2.0
Error code was: 42
Build step 'Execute shell' marked build as failure
Finished: FAILURE

Same deal for the 2nd PR testing iteration shown in https://github.com/trilinos/Trilinos/pull/4021#issuecomment-446051962 showing:

ERROR: Unable to find matching environment for job: Trilinos_pullrequest_gcc_7.2.0
Error code was: 42
Build step 'Execute shell' marked build as failure
Finished: FAILURE

bartlettroscoe commented 5 years ago

CC: @mhoemmen

@trilinos/framework, looks like other PRs like https://github.com/trilinos/Trilinos/pull/4024#issuecomment-446103995 are being impacted as well showing the crash of the Trilinos_pullrequest_gcc_7.2.0 build showing:

ERROR: Unable to find matching environment for job: Trilinos_pullrequest_gcc_7.2.0
Error code was: 42
Build step 'Execute shell' marked build as failure
Finished: FAILURE

bartlettroscoe commented 5 years ago

CC: @rppawlo

@trilinos/framework,

Help, the crash again just now as shown in https://github.com/trilinos/Trilinos/pull/4021#issuecomment-446272302 with:

ERROR: Unable to find matching environment for job: Trilinos_pullrequest_gcc_7.2.0
Error code was: 42
Build step 'Execute shell' marked build as failure
Finished: FAILURE

Same deal in https://github.com/trilinos/Trilinos/pull/4026#issuecomment-446232918 and https://github.com/trilinos/Trilinos/pull/4024#issuecomment-446223929.

Can you please remove the Trilinos_pullrequest_gcc_7.2.0 PR build until it can be fixed? No one is going to be able to merge a PR until this gets fixed (we are just burning up cycles on computers).

ZUUL42 commented 5 years ago

Have you pulled in the latest changes? We moved dev2master from GCC 7.3.0 to 7.2.0 and once that checked out added 7.2.0 to the PR autotester. In doing so added a couple files and changed another. They are in dev & master, but I wonder if your build isn't seeing: /cmake/std/PullRequestLinuxGCC7.2.0TestingSettings.cmake /cmake/std/sems/PullRequestGCC7.2.0TestingEnv.sh Or the most recent changes to: /cmake/std/PullRequestLinuxDriver-Test.sh

If those are pulled in and it still doesn't build, I'll go ahead and remove the 7.2.0 build from the autotester until we can determine the solution.

trilinos / Trilinos

Trilinos auto PR tester stability issues #3276

Description

PR Builds Showing Random Failures