Testing plug in version 1.2.0-SNAPSHOT in order to prepare new release

Max-Serra commented 8 years ago

Hi Ace,

I used the plug in with latest code modifying the complex setup which looks like the following (in the picture still set according to the previous 1.1.1 version):

regressiontree

I switched the node "Inventory Test Suite" to all the available patterns, running only the branch within the red rectangle and it's working fine as expected:

inventorybranch

By the latest code I realized you solved further issues present in the 1.1.1 version.

There was a problem in the Parameter/Execution Plan tabs visualization solved by the commit https://github.com/ace-han/coordinator/commit/91e691e8dce920232fc7e36f4cbf64c81e970760
Changing the logic of the execution to implement the non-breaking feature you removed the hugly output printed in the execution plan, when a serial branch stops the execution after a failure, see below:

The message "Server side error. Please checkout the server log" was really misleading at the user perspective, highlighting an error in the execution, even if the plug in was working normally, in the case of a failure. Now simply the following jobs are not executed. It is good as well how the plug in works and shows the results when a parallel - breaking pattern is set:

parallelbreakingresult

The two jobs in the same branch have been correctly stopped as consequence of the failure, and are marked correctly in the plan result as aborted.

I' m still testing the case in which all the nodes under "Test Suite" are launched all together, with parallel-non breaking, because I'm facing a not correct behavior, which seems seen also in the release 1.1.1.

I need to better investigate and in the case there is an issue, I need to setup a plan reproducible by you, since I'm working with a complex master-slave configuration.

Anyway, so far we are in the good direction, very good work.

ace-han commented 8 years ago

@Max-Serra I think we can publish v1.2.0 this weekend.

All major issues in this release have been fixed.

See if anything unexpected on your side, branch for relevant bug fixing is release/1.2.0 for the time being.

Max-Serra commented 8 years ago

Hi Ace,

As anticipated in the first comment in this thread, I analyzed deeply an issue seen during this test phase. I found it even in the previous commit, but I updated the plug in version according to the HEAD of release/1.2.0 branch.

Description of the issue: using my working Execution Plan in the first picture, but set accordingly to the new non-breaking features , when the whole "Test Suite" node hierarchy has been enabled, four serial nodes will be executed in parallel in a jenkins slave. During the execution one of the leafs fails, and this causes the abort of all the other jobs still running, even if they were running under a parent node set as parallel non - breaking.

In order to make the issue reproducible by you, I created a simpler plan according to the following:

the plan structure is the same as in the original Test Suite tree
all the jobs will do anything, just executing a linux shell waiting a such time in order to reproduce the same time execution of the overall original Test Suite plan
only one job is forced to fail after the same time as in the emulated plan
the jobs will be executed in the master (I guess it's needed to increase the number of supported Build Executors)

The plan looks like the screenshot below:

reproducibleplan

Result over the plugin version 1.2.0-SNAPSHOT:

plancoordinator1_2_0

As shown in the picture, the job step4 is failing after 2 minutes and it broke the execution of the others still running under the other sequences. It is wrong since the parent node is set as parallel non breaking.

Result over the plugin version 1.1.1 (of course in this case exists only parallel non breaking and serial breaking):

plancoordinator1_1_1

Now even if the step4 is still failing, the other jobs didn't break the execution,which is correct.

Below I attached the jobs to reproduce the issue in a Linux environment. The coordinator job triggering the others is Breaking_Non_Breaking, valid only for version 1.2.0-SNAPSHOT (to work under 1.1.1, I created from scratch the same Coordinator job,in an other jenkins instance and just reusing the atomic jobs previously created). I hope this will work in your environment as well.

reproducible_plan.zip

ace-han commented 8 years ago

@Max-Serra Thx for raising it out.

I took it for granted silly silly silly, that any failing atomic job only checking its direct parent's breaking option to decide the whole build's status...

I will look into it soon. Relevant codes if you are interested in.

ace-han commented 8 years ago

@Max-Serra

I make up my test case as below:

          Root_P_non_breaking
          |-- 1_S_breaking
          |   |-- 11_L_2s
          |   |__ 12_S_non_breaking
          |   |   |-- 121_L_Failure
          |   |   |__ 122_L_4s
          |   |__ 13_L_2s
          |-- 2_S_breaking
          |   |-- 21_L_2s
          |   |__ 22_L_2s
          |__ 3_S_breaking
              |-- 31_L_2s
              |-- 32_P_breaking
              |   |-- 321_L_Failure
              |   |__ 322_L_2s
              |__ 33_L_2s

        triggered ( AbstractProject.createExecutable() )
          122_L_4s, 13_L_2s, 
          21_L_2s, 22_L_2s, 
          31_L_2s, 322_L_2s
        not triggered:
          33_L_2s 
        not aborted:
          22_L_2s, 322_L_2s
        coordinator build should be unstable

and the latest code on release/1.2.0 should be able to nail it

Let me know if anything unexpected in your scenarios.

Max-Serra commented 8 years ago

Hi Ace,

I've seen you are still pushing code, after your latest comment. Should I still wait before testing the issue discovered?

ace-han commented 8 years ago

Hi,

Sorry for the confusion.

Should be no more code pushing unless any issue raised by you

Max-Serra commented 8 years ago

Hi again,

Now the tree is working as expected but the Master Coordinator job (Breaking_Non_Breaking) isn't no longer stopping, running indefinitely even if all the triggered jobs completed. I had to stop it manually.

Started by user Max Serra [EnvInject] - Loading node environment variables. Building on master in workspace /jenkins/jobs/Breaking_Non_Breaking/workspace Atomic Job ( step1 ) Triggered Atomic Job ( step21 ) Triggered Atomic Job ( step31 ) Triggered Atomic Job ( step4 ) Triggered Atomic Job: step1 # 5 Completed, Result: SUCCESS Atomic Job ( step11 ) Triggered Atomic Job ( step12 ) Triggered Atomic Job ( step13 ) Triggered Atomic Job: step31 # 6 Completed, Result: SUCCESS Atomic Job ( step32 ) Triggered Atomic Job: step21 # 6 Completed, Result: SUCCESS Atomic Job ( step22 ) Triggered Atomic Job: step4 # 6 Completed, Result: FAILURE Atomic Job: step32 # 5 Completed, Result: SUCCESS Atomic Job: step22 # 5 Completed, Result: SUCCESS Atomic Job: step11 # 6 Completed, Result: SUCCESS Atomic Job: step12 # 6 Completed, Result: SUCCESS Atomic Job: step13 # 6 Completed, Result: SUCCESS Unexpected Interruption: java.lang.InterruptedException: sleep interrupted Build step 'Coordinator' changed build result to UNSTABLE Build step 'Coordinator' marked build as failure Started calculate disk usage of build Finished Calculation of disk usage of build in 0 seconds Started calculate disk usage of workspace Finished Calculation of disk usage of workspace in 0 seconds Finished: FAILURE

ace-han commented 8 years ago

Hi, I should already have this problem fixed in https://github.com/jenkinsci/coordinator-plugin/commit/ef0568c850d801a9091c0e0315d690bc37d5c239

plz test with the latest code in release/1.2.0

Max-Serra commented 8 years ago

Hi again,

I packaged the hpi over the commit f53f62ad0d3484dfaa40944e135cb25a1f5439f4 and I confirm that the master is running endlessly. Could you try running the reproducible plan I attached here?

Below is what I see (I forced the job step12 to fail as well) :

masterjobendless

Max-Serra commented 8 years ago

Consider that I changed locally the pom in order to point to jenkins 1.631, which is the version running in my production env. Maybe it isn't meaningful, maybe it can be.

`

org.jenkins-ci.plugins

      <artifactId>plugin</artifactId>
      <version>1.631</version>

`

ace-han commented 8 years ago

@Max-Serra I will hold on upgrading jenkins version.

jenkins-core is doing its job to make sure that every backend function in any version is compatible with its former ones except for UI changes or those breaking changes in release note.

In additions, this plugin is supposed to work with Jenkins since version 1.596.1.

If you find sth wrong in jenkins higher version, I'd be happy to make a fix for you

ace-han commented 8 years ago

@Max-Serra Regarding to the endless problem, I have added another testcase to ensure the coordinator-job ends normally.

Please try testing with the latest code, I do hope we can release this 1.2.0 by the coming Monday :sunglasses:

Latest execution snapshot

Max-Serra commented 8 years ago

Hi again,

Now the previous plan is working as expected and the endless issue is solved. Anyway, to test better I tried to check further configurations and I slightly changed the Execution Plan.I switched the root as Parallel Breaking and it isn't working as expected, because the failure in this case didn't break the other parallel executions:

rootparallelbreaking

ace-han commented 8 years ago

@Max-Serra Does "didn't work as expected" mean the coordinator-job fall into an endless loop again?

Otherwise, it's expected as designed.

Let's abstract here

Root_P_breaking
|--S_1_breaking
|   |-- ...
|--S_2_breaking
|   |-- ...
|--S_3_breaking
|   |-- ...
|__S_4_breaking
    |-- ...

By above configuration, any atomic jobs under same S_X_breaking will stop the one next to(executed after) the failing one but those in different branches S_Y_breaking.

That's the pattern of Parallel or Concurrency, don't you think?

ace-han commented 8 years ago

Although, I've already add some logic to cater below scenario

Root_P_breaking
|--S_1_breaking
|   |-- ... (1 or 2 jobs)
|--S_2_breaking
|   |-- ... (100 jobs)
|--S_3_breaking
|   |-- ...
|__S_4_breaking
    |-- ...

Let's say that S_1_breaking failed on the second atomic job and at the mean time S_2_breaking already finished the first 50 atomic jobs and 50 more to go. But the left 50(or 48,49 not exact 50) jobs to go will not be triggered

Max-Serra commented 8 years ago

...let's keep one of the previous results:

If your sentence is correct, this means that the behavior above is wrong. Originally the Parallel Breaking meant that while the nested jobs were executing and one of them failed, those running will be aborted and those not started yet will be not executed at all.

Now it seems instead that only the jobs not started yet will be impacted.

By my side isn't a problem, but it appears to be a new behavior, so in the scenario above, the jobs will be no longer aborted.

The important is that it will be valid in all the branches structures. I checked running again the plan in the picture above and effectively the jobs are no longer aborted, when already started.

Regarding the endless issue, I confirm that it has been solved in the latest code.

ace-han commented 8 years ago

@Max-Serra

I hope below two test cases could make myself clear on the underlying abortion logic

In a word, those already running and share the same breaking ancestor with the failed atomic job will be aborted

Scenario 1

          Root_P_breaking
          |-- 1_S_breaking
          |   |__ 12_S_non_breaking
          |   |   |-- 121_L_Failure
          |   |   |__ 122_L_8s
          |   |__ 13_L_2s
          |-- 2_S_breaking
          |   |-- 21_L_2s
          |   |__ 22_L_8s
          |__ 3_S_breaking
              |-- 31_L_2s
              |-- 32_P_breaking
              |   |-- 321_L_Failure
              |   |__ 322_L_2s
              |__ 33_L_2s

        triggered ( AbstractProject.createExecutable() )
          122_L_4s, 
          21_L_2s, 22_L_2s, 
          31_L_2s, 322_L_2s
        not triggered:
          33_L_2s 
        aborted: 
          122_L_8s, 22_L_8s, 322_L_2s
        coordinator build should be failure

Scenario 2

          Root_P_non_breaking
          |-- 1_S_breaking
          |   |-- 11_L_2s
          |   |__ 12_S_non_breaking
          |   |   |-- 121_L_Failure
          |   |   |__ 122_L_4s
          |   |__ 13_L_2s
          |-- 2_S_breaking
          |   |-- 21_L_2s
          |   |__ 22_L_2s
          |__ 3_S_breaking
              |-- 31_L_2s
              |-- 32_P_breaking
              |   |-- 321_L_Failure
              |   |__ 322_L_2s
              |__ 33_L_2s

        triggered ( AbstractProject.createExecutable() )
          122_L_4s, 13_L_2s, 
          21_L_2s, 22_L_2s, 
          31_L_2s, 322_L_2s
        not triggered:
          33_L_2s 
        not aborted:
          22_L_2s 
        aborted:
          322_L_2s
        coordinator build should be unstable

Anyway, Please test with the latest code, looking forward to your reply.

Max-Serra commented 8 years ago

Very good Ace,

I tried tens of combinations changing the timings and switching the patterns. Now the logic looks as expected, with the correct abortion or non triggering logic depending on the different setups.

The latest code under the commit 27f6b1f6f0831ea6225dbf44176e9c05b2604ab0 looks good for me.

ace-han commented 8 years ago

Good to know :+1:

jenkinsci / coordinator-plugin

Testing plug in version 1.2.0-SNAPSHOT in order to prepare new release #30