Closed Max-Serra closed 8 years ago
@Max-Serra I think we can publish v1.2.0
this weekend.
All major issues in this release have been fixed.
See if anything unexpected on your side, branch for relevant bug fixing is release/1.2.0
for the time being.
Hi Ace,
As anticipated in the first comment in this thread, I analyzed deeply an issue seen during this test phase. I found it even in the previous commit, but I updated the plug in version according to the HEAD of release/1.2.0 branch.
Description of the issue: using my working Execution Plan in the first picture, but set accordingly to the new non-breaking features , when the whole "Test Suite" node hierarchy has been enabled, four serial nodes will be executed in parallel in a jenkins slave. During the execution one of the leafs fails, and this causes the abort of all the other jobs still running, even if they were running under a parent node set as parallel non - breaking.
In order to make the issue reproducible by you, I created a simpler plan according to the following:
The plan looks like the screenshot below:
Result over the plugin version 1.2.0-SNAPSHOT:
As shown in the picture, the job step4 is failing after 2 minutes and it broke the execution of the others still running under the other sequences. It is wrong since the parent node is set as parallel non breaking.
Result over the plugin version 1.1.1 (of course in this case exists only parallel non breaking and serial breaking):
Now even if the step4 is still failing, the other jobs didn't break the execution,which is correct.
Below I attached the jobs to reproduce the issue in a Linux environment. The coordinator job triggering the others is Breaking_Non_Breaking, valid only for version 1.2.0-SNAPSHOT (to work under 1.1.1, I created from scratch the same Coordinator job,in an other jenkins instance and just reusing the atomic jobs previously created). I hope this will work in your environment as well.
@Max-Serra Thx for raising it out.
I took it for granted silly silly silly, that any failing atomic job only checking its direct parent's breaking option to decide the whole build's status...
I will look into it soon. Relevant codes if you are interested in.
@Max-Serra
I make up my test case as below:
Root_P_non_breaking
|-- 1_S_breaking
| |-- 11_L_2s
| |__ 12_S_non_breaking
| | |-- 121_L_Failure
| | |__ 122_L_4s
| |__ 13_L_2s
|-- 2_S_breaking
| |-- 21_L_2s
| |__ 22_L_2s
|__ 3_S_breaking
|-- 31_L_2s
|-- 32_P_breaking
| |-- 321_L_Failure
| |__ 322_L_2s
|__ 33_L_2s
triggered ( AbstractProject.createExecutable() )
122_L_4s, 13_L_2s,
21_L_2s, 22_L_2s,
31_L_2s, 322_L_2s
not triggered:
33_L_2s
not aborted:
22_L_2s, 322_L_2s
coordinator build should be unstable
and the latest code on release/1.2.0
should be able to nail it
Let me know if anything unexpected in your scenarios.
Hi Ace,
I've seen you are still pushing code, after your latest comment. Should I still wait before testing the issue discovered?
Hi,
Sorry for the confusion.
Should be no more code pushing unless any issue raised by you
Hi again,
Now the tree is working as expected but the Master Coordinator job (Breaking_Non_Breaking) isn't no longer stopping, running indefinitely even if all the triggered jobs completed. I had to stop it manually.
Started by user Max Serra [EnvInject] - Loading node environment variables. Building on master in workspace /jenkins/jobs/Breaking_Non_Breaking/workspace Atomic Job ( step1 ) Triggered Atomic Job ( step21 ) Triggered Atomic Job ( step31 ) Triggered Atomic Job ( step4 ) Triggered Atomic Job: step1 # 5 Completed, Result: SUCCESS Atomic Job ( step11 ) Triggered Atomic Job ( step12 ) Triggered Atomic Job ( step13 ) Triggered Atomic Job: step31 # 6 Completed, Result: SUCCESS Atomic Job ( step32 ) Triggered Atomic Job: step21 # 6 Completed, Result: SUCCESS Atomic Job ( step22 ) Triggered Atomic Job: step4 # 6 Completed, Result: FAILURE Atomic Job: step32 # 5 Completed, Result: SUCCESS Atomic Job: step22 # 5 Completed, Result: SUCCESS Atomic Job: step11 # 6 Completed, Result: SUCCESS Atomic Job: step12 # 6 Completed, Result: SUCCESS Atomic Job: step13 # 6 Completed, Result: SUCCESS Unexpected Interruption: java.lang.InterruptedException: sleep interrupted Build step 'Coordinator' changed build result to UNSTABLE Build step 'Coordinator' marked build as failure Started calculate disk usage of build Finished Calculation of disk usage of build in 0 seconds Started calculate disk usage of workspace Finished Calculation of disk usage of workspace in 0 seconds Finished: FAILURE
Hi, I should already have this problem fixed in https://github.com/jenkinsci/coordinator-plugin/commit/ef0568c850d801a9091c0e0315d690bc37d5c239
plz test with the latest code in release/1.2.0
Hi again,
I packaged the hpi over the commit f53f62ad0d3484dfaa40944e135cb25a1f5439f4 and I confirm that the master is running endlessly. Could you try running the reproducible plan I attached here?
Below is what I see (I forced the job step12 to fail as well) :
Consider that I changed locally the pom in order to point to jenkins 1.631, which is the version running in my production env. Maybe it isn't meaningful, maybe it can be.
`
<artifactId>plugin</artifactId>
<version>1.631</version>
`
@Max-Serra I will hold on upgrading jenkins version.
jenkins-core
is doing its job to make sure that every backend function in any version is compatible with its former ones except for UI changes or those breaking changes in release note.
In additions, this plugin is supposed to work with Jenkins since version 1.596.1.
If you find sth wrong in jenkins higher version, I'd be happy to make a fix for you
@Max-Serra Regarding to the endless problem, I have added another testcase to ensure the coordinator-job
ends normally.
Please try testing with the latest code, I do hope we can release this 1.2.0
by the coming Monday :sunglasses:
Latest execution snapshot
Hi again,
Now the previous plan is working as expected and the endless issue is solved. Anyway, to test better I tried to check further configurations and I slightly changed the Execution Plan.I switched the root as Parallel Breaking and it isn't working as expected, because the failure in this case didn't break the other parallel executions:
@Max-Serra Does "didn't work as expected" mean the coordinator-job
fall into an endless loop again?
Otherwise, it's expected as designed.
Let's abstract here
Root_P_breaking
|--S_1_breaking
| |-- ...
|--S_2_breaking
| |-- ...
|--S_3_breaking
| |-- ...
|__S_4_breaking
|-- ...
By above configuration, any atomic jobs under same S_X_breaking
will stop the one next to(executed after) the failing one but those in different branches S_Y_breaking
.
That's the pattern of Parallel
or Concurrency
, don't you think?
Although, I've already add some logic to cater below scenario
Root_P_breaking
|--S_1_breaking
| |-- ... (1 or 2 jobs)
|--S_2_breaking
| |-- ... (100 jobs)
|--S_3_breaking
| |-- ...
|__S_4_breaking
|-- ...
Let's say that S_1_breaking
failed on the second atomic job and at the mean time
S_2_breaking
already finished the first 50 atomic jobs and 50 more to go. But the left 50(or 48,49 not exact 50) jobs to go
will not be triggered
...let's keep one of the previous results:
If your sentence is correct, this means that the behavior above is wrong. Originally the Parallel Breaking meant that while the nested jobs were executing and one of them failed, those running will be aborted and those not started yet will be not executed at all.
Now it seems instead that only the jobs not started yet will be impacted.
By my side isn't a problem, but it appears to be a new behavior, so in the scenario above, the jobs will be no longer aborted.
The important is that it will be valid in all the branches structures. I checked running again the plan in the picture above and effectively the jobs are no longer aborted, when already started.
Regarding the endless issue, I confirm that it has been solved in the latest code.
@Max-Serra
I hope below two test cases could make myself clear on the underlying abortion logic
In a word, those already running and share the same breaking ancestor with the failed atomic job will be aborted
Root_P_breaking
|-- 1_S_breaking
| |__ 12_S_non_breaking
| | |-- 121_L_Failure
| | |__ 122_L_8s
| |__ 13_L_2s
|-- 2_S_breaking
| |-- 21_L_2s
| |__ 22_L_8s
|__ 3_S_breaking
|-- 31_L_2s
|-- 32_P_breaking
| |-- 321_L_Failure
| |__ 322_L_2s
|__ 33_L_2s
triggered ( AbstractProject.createExecutable() )
122_L_4s,
21_L_2s, 22_L_2s,
31_L_2s, 322_L_2s
not triggered:
33_L_2s
aborted:
122_L_8s, 22_L_8s, 322_L_2s
coordinator build should be failure
Root_P_non_breaking
|-- 1_S_breaking
| |-- 11_L_2s
| |__ 12_S_non_breaking
| | |-- 121_L_Failure
| | |__ 122_L_4s
| |__ 13_L_2s
|-- 2_S_breaking
| |-- 21_L_2s
| |__ 22_L_2s
|__ 3_S_breaking
|-- 31_L_2s
|-- 32_P_breaking
| |-- 321_L_Failure
| |__ 322_L_2s
|__ 33_L_2s
triggered ( AbstractProject.createExecutable() )
122_L_4s, 13_L_2s,
21_L_2s, 22_L_2s,
31_L_2s, 322_L_2s
not triggered:
33_L_2s
not aborted:
22_L_2s
aborted:
322_L_2s
coordinator build should be unstable
Anyway, Please test with the latest code, looking forward to your reply.
Very good Ace,
I tried tens of combinations changing the timings and switching the patterns. Now the logic looks as expected, with the correct abortion or non triggering logic depending on the different setups.
The latest code under the commit 27f6b1f6f0831ea6225dbf44176e9c05b2604ab0 looks good for me.
Good to know :+1:
Hi Ace,
I used the plug in with latest code modifying the complex setup which looks like the following (in the picture still set according to the previous 1.1.1 version):
I switched the node "Inventory Test Suite" to all the available patterns, running only the branch within the red rectangle and it's working fine as expected:
By the latest code I realized you solved further issues present in the 1.1.1 version.
The message "Server side error. Please checkout the server log" was really misleading at the user perspective, highlighting an error in the execution, even if the plug in was working normally, in the case of a failure. Now simply the following jobs are not executed. It is good as well how the plug in works and shows the results when a parallel - breaking pattern is set:
The two jobs in the same branch have been correctly stopped as consequence of the failure, and are marked correctly in the plan result as aborted.
I' m still testing the case in which all the nodes under "Test Suite" are launched all together, with parallel-non breaking, because I'm facing a not correct behavior, which seems seen also in the release 1.1.1.
I need to better investigate and in the case there is an issue, I need to setup a plan reproducible by you, since I'm working with a complex master-slave configuration.
Anyway, so far we are in the good direction, very good work.