pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
534 stars 280 forks source link

test: Invalid test results files when testing is terminated. #6105

Closed tonyzinger closed 2 years ago

tonyzinger commented 2 years ago

I have a Jenkins job that is executing the mpich tests. When the Jenkins job hits its time limit and kills the mpich testing jobs, the summary.xml and summary.junit.xml files are invalid and incomplete.

The summary.xml file begins with :

<?xml version='1.0' ?>
<?xml-stylesheet href="TestResults.xsl" type="text/xsl" ?>
<MPITESTRESULTS>
<DATE>2022-08-03-21-03</DATE>
<MPITEST>
<NAME>attrt</NAME>
<ARGS></ARGS>
<NP>2</NP>
<WORKDIR>./attr</WORKDIR>
<STATUS>pass</STATUS>
<TIME>0.325302839279175</TIME>
</MPITEST>

and ends with (depending on when it gets terminated):

</MPITEST>
<MPITEST>
<NAME>allred3</NAME>
<ARGS></ARGS>
<NP>10</NP>
<WORKDIR>./coll</WORKDIR>

The last entry is not completed, i.e. ends with a . And it does not have (even with a successful completion of the tests): </MPITESTRESULTS> For the summary.juinit.xml file (in a successful test run) it begins with:

<testsuites>
  <testsuite failures="350"
             errors="0"
             skipped="26"
             tests="2464"
             date="2022-07-30-12-45"
             name="summary_junit_xml">

and ends with:

  </testsuite>
</testsuites>

However, if the testing is aborted, i.e. terminated before it completes successfully, these lines are not in the file.

hzhou commented 2 years ago

When a Jenkins job hits its time limit, the script can't wrap the XML file since the script is killed by Jenkins. We only support complete XML report if the runtests script runs to completion.

Maybe you can add a post processing job step to cleanup/wrap the partial summary.xml when the job is killed by timeout?

hzhou commented 2 years ago

@tonyzinger Did you find the above answers helpful?

tonyzinger commented 2 years ago

@hzhou No because I do not know what XML entry is being written or partially written. So I do not know what I need to add to close out the last entry. Also, I do not know what the test results information to put into the <testsuite ... entry.

Would it be possible to add a try block round the test execution script to catch various errors, i.e. timed out? This is probably not related but I noticed that the last couple of test runs that most of the tests are timing out and the script exits with "Terminating test because of too many timeout failures" and the Junit xml file looks correct.

hzhou commented 2 years ago

The time-out is set by Jenkins,and the test script is killed by Jenkins, thus, there is nothing the test script can do once it is killed. Whatever the solution should come from the Jenkins side -- either increase the timeout setting or modify the job parameters so that the tests can finish within the timeout.

The mpich test script runs each test with some timeout setting as well. When the test fails with that timeout, the test script is still running, thus is able to write the report entry. This is separate from Jenkins time-out.

tonyzinger commented 2 years ago

The stage in the jenkins job that I am executing the Mpich tests in has a timeout value of 15 hours. This was a sufficient amount of time in most test sessions even with many Mpich tests timing out. For example, "350 tests failed out of 2464 (total runtime: 813 min 46 sec)". But more often than not this is not enough time. Since I do not understand the Mpich tests it is very difficult to determine what is causing a significant number of test cases to hang until they timeout. That is why I was trying to process the xml file to understand better why the tests are failing and to have a history of passing and failing test cases.

hzhou commented 2 years ago

I see. Could you try add an additional Jenkins job step with following shell code --

if ! grep '</testsuites>' test/mpi/summary.junit.xml ; then
    echo '</testsuite></testsuites>' >> test/mpi/summary.junit.xml
fi

Hopefully, when Jenkins kills the job, it is in the middle of running a test rather than in the middle of writing summary.junit.xml. If your summary.junit.xml is in a different relative path, adjust accordingly.

tonyzinger commented 2 years ago

If the test run is terminated, then the , , and entries are not in the xml file. So all 4 of these lines would need to be added to the xml file. 2 of them at the beginning of the xml file and 2 of them at the end of the xml file.

hzhou commented 2 years ago

I see. Maybe try --

if ! grep '</testsuites>' test/mpi/summary.junit.xml ; then
    mv test/mpi/summary.junit.xml incomplete.xml
    echo '<testsuites><testuite>'  > test/mpi/summary.junit.xml
    cat incomplete.xml >> test/mpi/summary.junit.xml
    echo '</testsuite></testsuites>' >> test/mpi/summary.junit.xml
fi

Hopefully Jenkins will be able to parse the XML and report something.

Usually when large amount to tests timeout, something severe is wrong and you should be able to catch that with a subset of tests. How about create a separate Jenkins tests where you run a customized testlist, for example

echo 'attr' > test/mpi/testlist

before make testing.

You should only run the whole test suite after the subset tests are passed.

tonyzinger commented 2 years ago

When I try to use the summary.junit.xml file to display the test results in a Jenkins job using xinit, it returns that the xml file is invalid and does not display any test results. I have made a couple of changes to the Junit XML output file to get it to display correctly.

These changes are from the https://llg.cubic.org/docs/junit/ documenting the "JUnit XML reporting file format for Jenkins".

The changes are:

diff --git a/test/mpi/runtests b/test/mpi/runtests
index b700285bf..7a9443db5 100755
--- a/test/mpi/runtests
+++ b/test/mpi/runtests
@@ -345,14 +345,15 @@ if ($junitoutput) {
     # Note: the field "errors" is not used now, but reserved for future uses.
     open my $JUNITIN,  '<',  $g_opt{junitfile}      or die "Can't read old file: $!";
     open my $JUNITOUTNEW, '>', "$g_opt{junitfile}.new" or die "Can't write new file: $!";
-    my $date = `date "+%Y-%m-%d-%H-%M"`;
+    my $date = `date "+%Y-%m-%dT%H:%M:%S"`;
     $date =~ s/\r?\n//;
+    print $JUNITOUTNEW "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n";
     print $JUNITOUTNEW "<testsuites>\n";
     print $JUNITOUTNEW "  <testsuite failures=\"$g_err_count\"\n";
     print $JUNITOUTNEW "             errors=\"0\"\n";
     print $JUNITOUTNEW "             skipped=\"$g_skip_count\"\n";
     print $JUNITOUTNEW "             tests=\"$g_total_run\"\n";
-    print $JUNITOUTNEW "             date=\"${date}\"\n";
+    print $JUNITOUTNEW "             timestamp=\"${date}\"\n";
     print $JUNITOUTNEW "             name=\"summary_junit_xml\">\n";
     while( <$JUNITIN> ) {
         print $JUNITOUTNEW $_;

Would it be possible to make these changes? Thanks.

hzhou commented 2 years ago

Would it be possible to make these changes?

Sure. I can make a PR with these changes. You are also welcome to contribute the PR yourself. For new authors, we just require signing contribution agreement

hzhou commented 2 years ago

On a second thought, maybe it is possible for runtests to trap the kill signal and wrap XML report. I'll give it a try later.

hzhou commented 2 years ago

@tonyzinger Could you try https://github.com/pmodels/mpich/pull/6111?

tonyzinger commented 2 years ago

@hzhou I will give it a try. It will probably be Monday before I can verify it. Thanks.

tonyzinger commented 2 years ago

I copied you runtest script from your PR into my Jenkins directory for test execution. I set the Jenkins Mpich Testing Stage to 15 minutes. The Jenkins job reached the timeout value and terminated.

10:21:55  Cancelling nested steps due to timeout
10:21:55  Sending interrupt signal to process

The resulting summary.junit.xml file did NOT have the following lines at the beginning of the file:

<?xml version="1.0" encoding="UTF-8"?>
<testsuites>
  <testsuite failures="xx"
             errors="0"
             skipped="xx"
             tests="xxx"
             timestamp="2022-08-15T09:56:52"
             name="summary_junit_xml">

And it did NOT have the following lines at the end of the file:

  </testsuite>
</testsuites>

The file just contains the "testcase" information.

hzhou commented 2 years ago

I tested using Jenkins' "Run with timeout" job step, and the console log are -

...
Build timed out (after 45 minutes). Marking the build as failed.
Build was aborted
Archiving artifacts
Unexpected output in mt_improbe_sendrecv_huge: 
Unexpected output in mt_improbe_sendrecv_huge: ===================================================================================
Unexpected output in mt_improbe_sendrecv_huge: =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
Unexpected output in mt_improbe_sendrecv_huge: =   PID 28362 RUNNING AT pmrs-gpu-240-02.cels.anl.gov
Unexpected output in mt_improbe_sendrecv_huge: =   EXIT CODE: 15
Unexpected output in mt_improbe_sendrecv_huge: =   CLEANING UP REMAINING PROCESSES
Unexpected output in mt_improbe_sendrecv_huge: =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
Unexpected output in mt_improbe_sendrecv_huge: ===================================================================================
Unexpected output in mt_improbe_sendrecv_huge: YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
Unexpected output in mt_improbe_sendrecv_huge: This typically refers to a problem with your application.
Unexpected output in mt_improbe_sendrecv_huge: Please see the FAQ page for debugging suggestions
Program mt_improbe_sendrecv_huge exited without No Errors
Killed: 
make: *** [testing] Error 255
Recording test results
...

What timeout mechanism do you use? Do you see Killed: in the console log? That is something printed by runtests from the signal handler.

tonyzinger commented 2 years ago

No. I did not see any messages written by runtests. The only lines that I saw were:

10:21:00  Unexpected output in bcast: base: struct of MPI_INT:4+MPI_DOUBLE:8
10:21:00  Unexpected output in bcast: ]
10:21:55  Cancelling nested steps due to timeout
10:21:55  Sending interrupt signal to process
[Pipeline] }
[Pipeline] // script
Post stage
[Pipeline] echo
10:21:59  *** Test: currentBuild.currentResult: SUCCESS ***

The Last line is from the "Post Always" from the Mpich Testing Stage.

And I did verify that your version of the runtest was installed in the test/mpi directory. It contains the lines:

sub on_SIGINT {
    chdir $g_cwd or die;
    RunPostMsg();
    CloseOutputs();
    die "Killed: $!\n";
}

I also looked at the summary.xml file to see if the RunPostMsg() was called and it appears that it was not. The summary.xml file did not have a </MPITEST> at the end of the file. Also, should there be a </MPITESTRESULTS> at the end of this file.

hzhou commented 2 years ago

If the script can't catch the signal (e.g. SIGKILL), then there is nothing the script can do. Currently -- https://github.com/pmodels/mpich/blob/108b611d6fdef3f6268be42e93f4b0320f96a6d5/test/mpi/runtests#L1133-L1135

If you can find out which signal Jenkins' used to kill processes, then we can add it to the list.

tonyzinger commented 2 years ago

I have not been able to determine which signal is being sent on a timeout.

I missed your question about the timeout mechanism that I use. This is a simplified version of the Mpich testing stage:

 stage('Testing MPICH') {
      options {
                timeout (time: 8, unit: 'MINUTES')
       }
       steps {
            script {
                try {
                    sh "MPIR_CVAR_NUM_CLIQUES=2 make V=1 testing"
                } catch (x) {
                    echo "test failed"
                }
            }
       }
       post {
           alway {
               echo "post always processing"
           }
       }
  }
hzhou commented 2 years ago

I don't have much clue, but could you try the step without try-catch?

tonyzinger commented 2 years ago

As you suggested by removing the try-catch block, it works. When a timeout occurs the summary.junit.xml file has the correct entries at the beginning and the end of the file.

One other suggestion, when you print out the the testcase test number in the junit format, could you make this a 5 digit number with leading zeros. I ask this because when the test cases are displayed via the junit command it does a string sort. For example, it looks like: 1 - test 10 - test 11 - test 100 - test 2 - test 20 - test 200 - test 3 - test 4 -test

When tests: 1, 2, 3, 4, could be tests from one group, tests 10, 11 and 20 from a second group, test 100 from a third group and 200 from a fourth group. So this would allow all tests from a particular group to be in sequence together versus interspersed with all of the tests. The code would be something like:

        $str_total_run = sprintf('%05d', $g_total_run);
        print JUNITOUT "    <testcase name=\"$str_total_run - $workdir/$programname ${np} ${progArgs} ${progEnv}\" time=\"$runtime\">\n";

This would be needed whenever you would print out the g_total_run value in the junit output.

One last question, when I execute the Mpich tests the other day, I got the following test results:

  <testsuite failures="336"
             errors="0"
             skipped="26"
             tests="2464"

From my calculation this means 2464 - 336 = 2128 tests passed. Are these numbers accurate? What test results numbers should I be seeing? Thanks.

hzhou commented 2 years ago

One other suggestion, when you print out the the testcase test number in the junit format, could you make this a 5 digit number with leading zeros.

Sure. I'll add the patch.

From my calculation this means 2464 - 336 = 2128 tests passed.

2464 - 336 - 26 = 2102 tests passed

Are these numbers accurate? What test results numbers should I be seeing?

The numbers should be accurate. The numbers depend on the number of tests loaded from all testlist files, and it may depend configure checks and options. 2464 seems to be a reasonable number.

tonyzinger commented 2 years ago

The g_total_run variable is not incremented in the SkippedTest function. So I believe that my calculation is correct.

In the junit plugin display in Jenkins shows:

Duration  ↑ | Fail | (diff) | Skip | (diff) | Pass | (diff) | Total | (diff)
         336                 26                2128                2490
hzhou commented 2 years ago

Yes, you are right :) https://github.com/pmodels/mpich/blob/108b611d6fdef3f6268be42e93f4b0320f96a6d5/test/mpi/runtests#L120-L125