jenkinsci / amazon-ecs-plugin

Amazon EC2 Container Service Plugin for Jenkins
https://plugins.jenkins.io/amazon-ecs
MIT License
192 stars 227 forks source link

ECS Agents Re-used unexpectedly #343

Open melbit-michaelw opened 6 months ago

melbit-michaelw commented 6 months ago

Jenkins and plugins versions report

Environment ```text Jenkins: 2.426.3 OS: Linux - 4.14.336-178.554.amzn1.x86_64 Java: 17.0.9 - Eclipse Adoptium (OpenJDK 64-Bit Server VM) --- Office-365-Connector:4.21.0 amazon-ecr:1.114.vfd22430621f5 amazon-ecs:1.49 ansicolor:1.0.4 antisamy-markup-formatter:162.v0e6ec0fcfcf6 apache-httpcomponents-client-4-api:4.5.14-208.v438351942757 audit-trail:361.v82cde86c784e authentication-tokens:1.53.v1c90fd9191a_b_ aws-credentials:218.v1b_e9466ec5da_ aws-java-sdk-ec2:1.12.633-430.vf9a_e567a_244f aws-java-sdk-ecr:1.12.633-430.vf9a_e567a_244f aws-java-sdk-ecs:1.12.633-430.vf9a_e567a_244f aws-java-sdk-efs:1.12.633-430.vf9a_e567a_244f aws-java-sdk-minimal:1.12.633-430.vf9a_e567a_244f blueocean:1.27.10 blueocean-bitbucket-pipeline:1.27.10 blueocean-commons:1.27.10 blueocean-config:1.27.10 blueocean-core-js:1.27.10 blueocean-dashboard:1.27.10 blueocean-display-url:2.4.2 blueocean-events:1.27.10 blueocean-executor-info:1.27.10 blueocean-git-pipeline:1.27.10 blueocean-github-pipeline:1.27.10 blueocean-i18n:1.27.10 blueocean-jira:1.27.10 blueocean-jwt:1.27.10 blueocean-personalization:1.27.10 blueocean-pipeline-api-impl:1.27.10 blueocean-pipeline-editor:1.27.10 blueocean-pipeline-scm-api:1.27.10 blueocean-rest:1.27.10 blueocean-rest-impl:1.27.10 blueocean-web:1.27.10 bootstrap5-api:5.3.2-3 bouncycastle-api:2.30.1.77-225.v26ea_c9455fd9 branch-api:2.1144.v1425d1c3d5a_7 build-timeout:1.32 caffeine-api:3.1.8-133.v17b_1ff2e0599 checks-api:2.0.2 cloudbees-bitbucket-branch-source:866.vdea_7dcd3008e cloudbees-folder:6.858.v898218f3609d command-launcher:100.v2f6722292ee8 commons-lang3-api:3.13.0-62.v7d18e55f51e2 commons-text-api:1.11.0-95.v22a_d30ee5d36 concurrent-step:1.0.0 conditional-buildstep:1.4.3 config-file-provider:968.ve1ca_eb_913f8c configuration-as-code:1775.v810dc950b_514 copyartifact:722.v0662a_9b_e22a_c credentials:1319.v7eb_51b_3a_c97b_ credentials-binding:657.v2b_19db_7d6e6d display-url-api:2.200.vb_9327d658781 docker-build-publish:1.4.0 docker-commons:439.va_3cb_0a_6a_fb_29 docker-workflow:572.v950f58993843 durable-task:547.vd1ea_007d100c echarts-api:5.4.3-2 email-ext:2.104 envinject:2.908.v66a_774b_31d93 envinject-api:1.199.v3ce31253ed13 favorite:2.208.v91d65b_7792a_c font-awesome-api:6.5.1-2 git:5.2.1 git-client:4.6.0 github:1.38.0 github-api:1.318-461.v7a_c09c9fa_d63 github-branch-source:1772.va_69eda_d018d4 gson-api:2.10.1-15.v0d99f670e0a_7 handy-uri-templates-2-api:2.1.8-30.v7e777411b_148 hashicorp-vault-plugin:364.vf5d54b_3dc313 htmlpublisher:1.32 http_request:1.18 icon-shim:3.0.0 instance-identity:185.v303dc7c645f9 ionicons-api:56.v1b_1c8c49374e jackson2-api:2.16.1-373.ve709c6871598 jakarta-activation-api:2.0.1-3 jakarta-mail-api:2.0.1-3 javax-activation-api:1.2.0-6 javax-mail-api:1.6.2-9 jaxb:2.3.9-1 jdk-tool:66.vd8fa_64ee91b_d jenkins-design-language:1.27.10 jersey2-api:2.41-133.va_03323b_a_1396 jira:3.12 jjwt-api:0.11.5-77.v646c772fddb_0 job-import-plugin:3.6 jobConfigHistory:1229.v3039470161a_d joda-time-api:2.12.6-21.vca_fd74418fb_7 jquery3-api:3.7.1-1 json-api:20231013-17.v1c97069404b_e json-path-api:2.9.0-33.v2527142f2e1d junit:1259.v65ffcef24a_88 mailer:463.vedf8358e006b_ matrix-auth:3.2.1 matrix-project:822.824.v14451b_c0fd42 metrics:4.2.21-449.v6960d7c54c69 mina-sshd-api-common:2.12.0-90.v9f7fb_9fa_3d3b_ mina-sshd-api-core:2.12.0-90.v9f7fb_9fa_3d3b_ monitoring:1.95.0 okhttp-api:4.11.0-172.vda_da_1feeb_c6e pipeline-build-step:540.vb_e8849e1a_b_d8 pipeline-github-lib:42.v0739460cda_c4 pipeline-graph-analysis:202.va_d268e64deb_3 pipeline-groovy-lib:704.vc58b_8890a_384 pipeline-input-step:477.v339683a_8d55e pipeline-milestone-step:111.v449306f708b_7 pipeline-model-api:2.2175.v76a_fff0a_2618 pipeline-model-definition:2.2175.v76a_fff0a_2618 pipeline-model-extensions:2.2175.v76a_fff0a_2618 pipeline-rest-api:2.34 pipeline-stage-step:305.ve96d0205c1c6 pipeline-stage-tags-metadata:2.2175.v76a_fff0a_2618 pipeline-stage-view:2.34 pipeline-utility-steps:2.16.1 plain-credentials:143.v1b_df8b_d3b_e48 plugin-util-api:3.8.0 prism-api:1.29.0-10 pubsub-light:1.18 rebuild:330.v645b_7df10e2a_ resource-disposer:0.23 run-condition:1.7 saml:4.464.vea_cb_75d7f5e0 scm-api:683.vb_16722fb_b_80b_ script-security:1313.v7a_6067dc7087 slack:684.v833089650554 snakeyaml-api:2.2-111.vc6598e30cc65 sse-gateway:1.26 ssh-agent:346.vda_a_c4f2c8e50 ssh-credentials:308.ve4497b_ccd8f4 sshd:3.303.vefc7119b_ec23 structs:337.v1b_04ea_4df7c8 timestamper:1.26 token-macro:400.v35420b_922dcb_ trilead-api:2.133.vfb_8a_7b_9c5dd1 uno-choice:2.8.1 variant:60.v7290fc0eb_b_cd workflow-aggregator:596.v8c21c963d92d workflow-api:1291.v51fd2a_625da_7 workflow-basic-steps:1042.ve7b_140c4a_e0c workflow-cps:3853.vb_a_490d892963 workflow-durable-task-step:1322.v63864b_7a_e384 workflow-job:1385.vb_58b_86ea_fff1 workflow-multibranch:773.vc4fe1378f1d5 workflow-scm-step:415.v434365564324 workflow-step-api:657.v03b_e8115821b_ workflow-support:865.v43e78cc44e0d ws-cleanup:0.45 ```

What Operating System are you using (both controller, and any agents involved in the problem)?

AWS ECS hosting both the Jenkins instance and various agents.

Reproduction steps

Sorry, I don't have a simple reproduction at this stage. We've recently upgraded from Jenkins 2.346.3 with ecs plugin 1.48 to Jenkins 2.426.3 with ecs plugin 1.49.

We have a scripted pipeline that uses parallel to run multiple ecs nodes concurrently. For capacity reasons, this is limited to launching 3 concurrent nodes through the use of semaphores.

Since we upgraded, our ecs containers are being re-used when running these parallel jobs whilst previously the nodes would run a single job and then terminate.

I'm not sure if it's relevant, but the agent containers were also upgraded at the same time to use a newer version of the agent.jar (due to remoting requirements with newer Jenkins).

Is there some configuration option that I can set to ensure our ecs nodes run only a single job and then terminate ?

Expected Results

ECS Node runs a single job and then terminates.

Actual Results

ECS Node runs subsequent jobs after completion of the initial job.

Anything else?

No response

Are you interested in contributing a fix?

No response

melbit-michaelw commented 6 months ago

I've just done some testing on our Jenkins instance with the ECS plugin downgraded to 1.48 and don't see this behaviour. This implies that it's a change in the ECS plugin that has caused it.

What I'm not sure about, and don't have a test case for, is whether the ecs tasks are re-used only within a single pipeline, or whether they are re-used across other pipelines as well.

Either way, this breaks things for our use case, as we rely on the containers only running during their specific node block (we use scripted pipelines) (i.e. our pipelines are broken as we create a 'results' directory.. since some containers are now being reused, that directory already exists when the container is reused and results in the pipeline failing).

is there a workaround to force containers to only be used once ?

melbit-michaelw commented 6 months ago

Here's a reasonably minimal script that can be used to reproduce the issue:

node('ecs-agent-name') {
  stage("First") {
    sh(script:"""mkdir results""")
  }
}

node('ecs-agent-name') {
  stage("Second") {
    sh(script:"""mkdir results""")
  }
}

The 'First' stage will succeed, and then the 'Second' stage will fail as the results directory now exists due to the unexpected container re-use.

Stericson commented 6 months ago

@melbit-michaelw, I don't think this is a bug but rather the result of a different bug fix.

https://github.com/jenkinsci/amazon-ecs-plugin/issues/326

What is the number of executors you have set per agent?

melbit-michaelw commented 5 months ago

Hi @Stericson,

Sorry for the delayed response.

We aren't explicitly setting it anywhere (we are using config-as-code to configure Jenkins), and thus I believe it will be implicitly set to 1.

I ran the script console test code from the issue you linked and got back 1 executor.