trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.2k stars 564 forks source link

Many automated tests are not appearing on the dashboard #541

Closed jhux2 closed 8 years ago

jhux2 commented 8 years ago

@tawiesn @aprokop @bartlettroscoe The MueLu nightly dashboard is no longer reporting tests from our two main test machines, geminga and enigma. The tests from those two machines last appeared on August 1st. I don't know if it's related, but there was an automated tribits snapshot into Trilinos on that same day. Below is an error reported from tribits on geminga that appears to be related this issue.

Traceback (most recent call last):
   File 
"/home/aprokop/code/trilinos-test/trilinos/cmake/ctest/drivers/geminga/../cron_driver.py", 
line 23, in <module>
     tdd_driver.run_driver(this_path, repo_path)
   File 
"/data/code/trilinos-test/trilinos/cmake/tribits/dashboard_driver/tdd_driver.py", 
line 271, in run_driver
     "CTEST_BINARY_DIRECTORY" : tddDashboardRootDir+"/TDD_BUILD",
   File 
"/data/code/trilinos-test/trilinos/cmake/tribits/dashboard_driver/tdd_driver.py", 
line 182, in invoke_ctest
     extraEnv = environment
   File 
"/data/code/trilinos-test/trilinos/cmake/tribits/dashboard_driver/../python_utils/GeneralScriptSupport.py", 
line 469, in echoRunSysCmnd
     print("  Appending environment:" + extraEnv + "\n")
TypeError: cannot concatenate 'str' and 'dict' objects

Ending nightly Trilinos development testing on geminga: Fri Aug  5 
03:00:02 MDT 2016
jhux2 commented 8 years ago

I forgot to mention that tribits fails in the same way on geminga and enigma.

bartlettroscoe commented 8 years ago

commented less than a minute ago This is related to TriBITSPub/TriBITS#119

@wfspotz, any idea what the issue might be? Hopefully this is an easy fix.

jhux2 commented 8 years ago

@trilinos/framework Any word when this might be fixed? As I said, geminga and enigma are the main MueLu test machines -- they exercise a lot of custom configure options that aren't tested elsewhere.

bartlettroscoe commented 8 years ago

@jhux2,

Oh boy, it looks like this took out all of the Nighty builds for Trilinos as shown here. I will look into getting this fixed ASAP.

How did no one notice this until just Friday?

The problem is that there are no automated tests for this python script so this defect passed through.

bartlettroscoe commented 8 years ago

NOTE: This did not affect the CI build of Trilinos because that CI build does not use the tdd_driver.py script.

If the TDD system is going to be continued to me maintained, we are going to need to figure out a way to add automated testing for it.

jhux2 commented 8 years ago

How did no one notice this until just Friday?

We only get emails when there are failing tests; I just happened to look at the dashboard on Friday.

aprokop commented 8 years ago

I noticed it few days earlier, but assumed that the dashboard is misbehaving again. I guess the recent months' troubles with dashboard taught us not to expect its presence day to day.

bartlettroscoe commented 8 years ago

I have added an automated test to expose the problem and have fixed it. I will snapshot the updated TriBITS into Trilinos and then report back here.

bartlettroscoe commented 8 years ago

The following commits that I just pushed should hopefully fix this problem:

26c2244 "Merge branch 'tribits_github_snapshot' into develop"
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Mon Aug 8 17:52:16 2016 -0400 (15 minutes ago)

841e66d "Automatic snapshot commit from tribits at 1205b3f"
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Mon Aug 8 17:50:46 2016 -0400 (17 minutes ago)

M       cmake/tribits/python_utils/GeneralScriptSupport.py

Someone needs to be "on call" at any point in time and keep an eye on the Trilinos CDash site on a daily basis to make sure things look reasonable. This is what we do with CASL VERA with an "on call" team on a rotating basis (1 week at a time).

jhux2 commented 8 years ago

@bartlettroscoe Thanks for working this issue so quickly! Hopefully this fixes the problem.

bartlettroscoe commented 8 years ago

It looks like some "Nightly" results are showing up again shown here:

However, I fear that someone may have to manually pull the updated outer Trilinos repo that is used by the TDD system for the various builds or the broken tdd_driver.py script may not get updated. I am not sure but I fear that might be the case.

tawiesn commented 8 years ago

Manually pulled the Trilinos repository on enigma. Let's see whether it helps.

tawiesn commented 8 years ago

Started the nightly tests on my machine and they seem to run. We should see some test results from enigma soon...

bartlettroscoe commented 8 years ago

Looks like my fears may be unfounded. For example, we can see builds coming in from muir today here:

but not for the last few days as shown here:

How this update occurred is not clear to me. If Jenkins is driving the outer build and is pulling from Trilinos, then it would not be affected by a broken tdd_driver.py script.

We will wait and see what happens tomorrow.

tawiesn commented 8 years ago

Someone needs to be "on call" at any point in time and keep an eye on the Trilinos CDash site on a daily basis to make sure things look reasonable. This is what we do with CASL VERA with an "on call" team on a rotating basis (1 week at a time).

I think it is good idea that there is someone watching out every day to check whether there is a problem with the dash board. On the other side, i would expect that all of us check the dashboard the days after having made a commit (just to make sure that nothing is broken), since the checkin script never can catch all problems. Furthermore, it's always good to know whether commits made it into the master branch (or not). If we all follow this principle we wouldn't even need a special "on call" team.

bartlettroscoe commented 8 years ago

If we all follow this principle we wouldn't even need a special "on call" team.

From my own experience with this over several projects over many years if that if something is everyone's responsibility then it is no-one's responsibility. The active "on call" member just has to keep a watch and do the most basic triage and pass the issue on to someone else. The CASL project has been an interesting experimental test bed for these processes.

tawiesn commented 8 years ago

Enigma is back again as test machine (the tests show up on Monday, since i ran them early yesterday.).

bartlettroscoe commented 8 years ago

It is hard to tell if all of the Nightly builds are back (because none of the builds are marked as "expected builds") but there are a whole bunch of builds showing up today on the main CDash site:

and builds are also showing up on the my.cdash.org site:

Therefore, I am going to assume this is fixed. I am putting into review and asking Brent to review the set of builds showing up on CDash (since he should know that better than me).

bmpersc commented 8 years ago

It looks like all the builds I would expect are reporting.

bartlettroscoe commented 8 years ago

It looks like all the builds I would expect are reporting.

Thanks Brent!

Closing as complete.

jhux2 commented 8 years ago

Thank you, Ross.