jenkinsci / opentelemetry-plugin

Monitor and observe Jenkins with OpenTelemetry.
https://plugins.jenkins.io/opentelemetry/
Apache License 2.0
100 stars 53 forks source link

jenkins.pipeline.step.interruption.causes doesn't set for "skipped due to earlier failure(s)" #731

Closed ipleten closed 11 months ago

ipleten commented 1 year ago

Jenkins and plugins versions report

Environment ```text Jenkins: 2.414.2 OS: Linux - 5.15.0-1037-aws Java: 11.0.20.1 - Eclipse Adoptium (OpenJDK 64-Bit Server VM) --- Parameterized-Remote-Trigger:3.1.6.3 analysis-model-api:11.6.0 ansicolor:1.0.2 ant:487.vd79d090d4ea_e antisamy-markup-formatter:159.v25b_c67cd35fb_ apache-httpcomponents-client-4-api:4.5.14-208.v438351942757 artifactory:3.17.1 audit-trail:3.11 authentication-tokens:1.53.v1c90fd9191a_b_ aws-credentials:191.vcb_f183ce58b_9 aws-java-sdk:1.12.287-357.vf82d85a_6eefd aws-java-sdk-cloudformation:1.12.287-357.vf82d85a_6eefd aws-java-sdk-codebuild:1.12.287-357.vf82d85a_6eefd aws-java-sdk-ec2:1.12.287-357.vf82d85a_6eefd aws-java-sdk-ecr:1.12.287-357.vf82d85a_6eefd aws-java-sdk-ecs:1.12.287-357.vf82d85a_6eefd aws-java-sdk-efs:1.12.287-357.vf82d85a_6eefd aws-java-sdk-elasticbeanstalk:1.12.287-357.vf82d85a_6eefd aws-java-sdk-iam:1.12.287-357.vf82d85a_6eefd aws-java-sdk-logs:1.12.287-357.vf82d85a_6eefd aws-java-sdk-minimal:1.12.287-357.vf82d85a_6eefd aws-java-sdk-sns:1.12.287-357.vf82d85a_6eefd aws-java-sdk-sqs:1.12.287-357.vf82d85a_6eefd aws-java-sdk-ssm:1.12.287-357.vf82d85a_6eefd aws-lambda:0.5.10 basic-branch-build-strategies:1.3.2 blueocean:1.27.7 blueocean-autofavorite:1.2.5 blueocean-bitbucket-pipeline:1.27.7 blueocean-commons:1.27.7 blueocean-config:1.27.7 blueocean-core-js:1.27.7 blueocean-dashboard:1.27.7 blueocean-display-url:2.4.1 blueocean-events:1.27.7 blueocean-git-pipeline:1.27.7 blueocean-github-pipeline:1.27.7 blueocean-i18n:1.27.7 blueocean-jira:1.25.8 blueocean-jwt:1.27.7 blueocean-personalization:1.27.7 blueocean-pipeline-api-impl:1.27.7 blueocean-pipeline-editor:1.27.7 blueocean-pipeline-scm-api:1.27.7 blueocean-rest:1.27.7 blueocean-rest-impl:1.27.7 blueocean-web:1.27.7 bootstrap5-api:5.3.0-1 bouncycastle-api:2.29 branch-api:2.1122.v09cb_8ea_8a_724 build-name-setter:2.2.0 build-timeout:1.24 caffeine-api:3.1.8-133.v17b_1ff2e0599 checks-api:2.0.0 cloudbees-bitbucket-branch-source:832.v43175a_425ea_6 cloudbees-disk-usage-simple:178.v1a_4d2f6359a_8 cloudbees-folder:6.848.ve3b_fd7839a_81 command-launcher:100.v2f6722292ee8 commons-lang3-api:3.13.0-62.v7d18e55f51e2 commons-text-api:1.10.0-36.vc008c8fcda_7b_ compress-artifacts:1.10 conditional-buildstep:1.4.2 config-file-provider:959.vcff671a_4518b_ configuration-as-code:1647.ve39ca_b_829b_42 copyartifact:1.46.3 credentials:1271.v54b_1c2c6388a_ credentials-binding:636.v55f1275c7b_27 data-tables-api:1.13.3-4 display-url-api:2.3.9 docker-commons:439.va_3cb_0a_6a_fb_29 docker-workflow:521.v1a_a_dd2073b_2e durable-task:523.va_a_22cf15d5e0 dynamic-axis:1.0.3 echarts-api:5.4.0-5 email-ext:2.102 embeddable-build-status:255.va_d2370ee8fde extended-read-permission:3.2 favorite:2.4.3 file-parameters:316.va_83a_1221db_a_7 font-awesome-api:6.3.0-2 forensics-api:2.2.0 git:5.2.0 git-client:4.4.0 git-server:99.va_0826a_b_cdfa_d github:1.37.1 github-api:1.314-431.v78d72a_3fe4c3 github-branch-source:1732.v3f1889a_c475b_ gitlab-api:5.2.0-86.v1ed41a_9cf486 gitlab-branch-source:640.v7101b_1c0def9 gitlab-oauth:1.18 gitlab-plugin:1.7.8 google-container-registry-auth:0.3 google-oauth-plugin:1.0.9 google-play-android-publisher:4.2 gradle:1.40 h2-api:1.4.199 handy-uri-templates-2-api:2.1.8-22.v77d5b_75e6953 hashicorp-vault-plugin:361.v44fea_4fc08d9 htmlpublisher:1.32 instance-identity:173.va_37c494ec4e5 ionicons-api:56.v1b_1c8c49374e ivy:2.5 jackson2-api:2.15.2-350.v0c2f3f8fc595 jakarta-activation-api:2.0.1-3 jakarta-mail-api:2.0.1-3 javadoc:226.v71211feb_e7e9 javax-activation-api:1.2.0-6 javax-mail-api:1.6.2-9 jaxb:2.3.8-1 jdk-tool:66.vd8fa_64ee91b_d jenkins-design-language:1.27.7 jersey2-api:2.39.1-1 jira:3.8 jira-steps:2.0.165.v8846cf59f3db jjwt-api:0.11.5-77.v646c772fddb_0 job-dsl:1.82 job-import-plugin:3.6 jobConfigHistory:1229.v3039470161a_d jquery3-api:3.7.0-1 jsch:0.2.8-65.v052c39de79b_2 junit:1240.vf9529b_881428 kubernetes:4029.v5712230ccb_f8 kubernetes-cli:1.12.1 kubernetes-client-api:6.8.1-224.vd388fca_4db_3b_ kubernetes-credentials:0.11 lockable-resources:2.18 mailer:463.vedf8358e006b_ matrix-auth:3.1.8 matrix-project:808.v5a_b_5f56d6966 maven-plugin:3.22 metrics:4.2.18-442.v02e107157925 mina-sshd-api-common:2.10.0-69.v28e3e36d18eb_ mina-sshd-api-core:2.10.0-69.v28e3e36d18eb_ monitoring:1.91.0 new-relic:1.0.4 nodejs:1.6.1 nodelabelparameter:1.11.0 oauth-credentials:0.5 okhttp-api:4.11.0-157.v6852a_a_fa_ec11 opentelemetry:2.17.0 parameterized-scheduler:1.1 parameterized-trigger:2.45 pipeline-aws:1.43 pipeline-build-step:505.v5f0844d8d126 pipeline-graph-analysis:202.va_d268e64deb_3 pipeline-groovy-lib:656.va_a_ceeb_6ffb_f7 pipeline-input-step:477.v339683a_8d55e pipeline-maven:1342.vfc697b_789147 pipeline-maven-api:1342.vfc697b_789147 pipeline-milestone-step:111.v449306f708b_7 pipeline-model-api:2.2144.v077a_d1928a_40 pipeline-model-definition:2.2144.v077a_d1928a_40 pipeline-model-extensions:2.2144.v077a_d1928a_40 pipeline-rest-api:2.31 pipeline-stage-step:305.ve96d0205c1c6 pipeline-stage-tags-metadata:2.2144.v077a_d1928a_40 pipeline-stage-view:2.31 pipeline-utility-steps:2.16.0 plain-credentials:143.v1b_df8b_d3b_e48 plugin-util-api:3.3.0 prism-api:1.29.0-7 pubsub-light:1.17 rebuild:320.v5a_0933a_e7d61 resource-disposer:0.20 run-condition:1.5 saml:4.354.vdc8c005cda_34 sauce-ondemand:1.207 scm-api:676.v886669a_199a_a_ script-security:1275.v23895f409fb_d selenium-axis:0.0.6 slack:625.va_eeb_b_168ffb_0 snakeyaml-api:1.33-95.va_b_a_e3e47b_fa_4 sonar:2.14 splunk-devops:1.10.1 splunk-devops-extend:1.10.1 sse-gateway:1.26 ssh-agent:295.v9ca_a_1c7cc3a_a_ ssh-credentials:308.ve4497b_ccd8f4 ssh-slaves:2.846.v1b_70190624f5 sshd:3.303.vefc7119b_ec23 structs:325.vcb_307d2a_2782 throttle-concurrents:2.9 timestamper:1.20 token-macro:384.vf35b_f26814ec trilead-api:2.84.v72119de229b_7 uno-choice:2.7.2 variant:59.vf075fe829ccb warnings-ng:10.4.0 workflow-aggregator:590.v6a_d052e5a_a_b_5 workflow-api:1281.vca_5fddb_3fceb_ workflow-basic-steps:1017.vb_45b_302f0cea_ workflow-cps:3774.v4a_d648d409ce workflow-durable-task-step:1284.v4fcd365b_75b_e workflow-job:1346.v180a_63f40267 workflow-multibranch:756.v891d88f2cd46 workflow-scm-step:415.v434365564324 workflow-step-api:639.v6eca_cd8c04a_a_ workflow-support:848.v5a_383b_d14921 ws-cleanup:0.43 ```

What Operating System are you using (both controller, and any agents involved in the problem)?

official docker image

Reproduction steps

We use NewRelic so query against all pipelines for 2 days returns: from Span select uniques(jenkins.pipeline.step.interruption.causes) where jenkins.pipeline.step.interruption.causes is NOT NULL LIMIT 2000 since 2 days ago It supposed to return all unique causes.

As you might see skipped due to earlier failure(s) is not being propagated.

UserInterruption: Aborted by <REDACTED>
ExceededTimeout: Timeout has been exceeded
DownstreamFailureCause: <REDACTED> completed with status FAILURE (propagate: false to ignore)
DownstreamFailureCause: <REDACTED> completed with status FAILURE (propagate: false to ignore)
Rejection: Rejected by <REDACTED>
Rejection: Rejected by SYSTEM
DownstreamFailureCause: <REDACTED> completed with status FAILURE (propagate: false to ignore)
UserInterruption: Aborted by <REDACTED>
DownstreamFailureCause: <REDACTED> completed with status UNSTABLE (propagate: false to ignore)
UserInterruption: Aborted by <REDACTED>
DownstreamFailureCause: <REDACTED> completed with status UNSTABLE (propagate: false to ignore)
QueueTaskCancelled: Queue task was cancelled
Rejection: Rejected by  <REDACTED>
DownstreamFailureCause:  <REDACTED> completed with status FAILURE (propagate: false to ignore)

Expected Results

Skipped stages/steps should be easily filtered out by some field, like skipped due to earlier failure(s)

Actual Results

'skipped due to earlier failure(s)' is not in jenkins.pipeline.step.interruption.causes (multibranch pipelines)

Anything else?

Our goal is to find which steps actually failed, this is a bit hard as all next stages are marked as "ERROR" also. Exmaple. Given nested stage "compile-code-and-docker" -> "build-java" -> "build-docker".

 stage('compile-and-build-docker-service') {
        stages { 
          stage('compile-code') {
            parallel {
              stage('build-java') { }
              stage('build-java-client') { }
             } 
             ...
        }     
}
<other stages>

If build-java failed both compile-and-build-docker-service and compile-code will be marked as 'ERROR' as well as all <other_stages>. Therefor it's really hard to find a query which will show us which exactly stage caused failure. Currently we use 'earliest' stage failed which returns compile-and-build-docker-service (root of nested stages) which isn't accurate while still true. Another solution is to take 'latest' failed stage and filter out all by duration.ms > 1000 as all <other stages> are executing less than 1 second.

Would be great if we be able to filter them out by jenkins.pipeline.step.interruption.causes not like '%skipped due to earlier failure(s)%'

image
cyrille-leclerc commented 1 year ago

Thanks @ipleten , I'm sorry but I don't understand well the problem. I have not worked on this area for a while. Can you please share with us a simplified test case to reproduce the problem and indications of he span attributes that are contributing to the problem?

ipleten commented 1 year ago

Here is pipeline:

pipeline {
    agent {
        kubernetes {
            yaml '''
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: shell
    image: ubuntu
    command:
    - sleep
    args:
    - infinity
'''
            defaultContainer 'shell'
        }
    }
    stages {
        stage('first') {
            steps {
                echo "passed"
            }
        }
        stage('second') {
            steps {
                echo "passed"
            }
        }
        stage('failed') {
            steps {
                sh "exit 1" // <- LET'S FAIL HERE
            }
        }
        stage('it_also_ failed_but_it_skipped') {
            steps {
                echo "passed".
            }
        }
        stage('it_also_failed_but_it_skipped_2') {
            steps {
                echo "passed"
            }
        }
    }
}

Here is a log output:

[Pipeline] {
[Pipeline] container
[Pipeline] {
[Pipeline] stage
[Pipeline] { (first)
[Pipeline] echo
passed
[Pipeline] }
[Pipeline] // stage
[Pipeline] stage
[Pipeline] { (second)
[Pipeline] echo
passed
[Pipeline] }
[Pipeline] // stage
[Pipeline] stage
[Pipeline] { (failed)
[Pipeline] sh
+ exit 1
[Pipeline] }
[Pipeline] // stage
[Pipeline] stage
[Pipeline] { (it_also_ failed_but_it_skipped)
Stage "it_also_ failed_but_it_skipped" skipped due to earlier failure(s)
[Pipeline] }
[Pipeline] // stage
[Pipeline] stage
[Pipeline] { (it_also_failed_but_it_skipped_2)
Stage "it_also_failed_but_it_skipped_2" skipped due to earlier failure(s)
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // container
[Pipeline] }
[Pipeline] // node
[Pipeline] }
[Pipeline] // podTemplate
[Pipeline] End of Pipeline
ERROR: script returned exit code 1
Finished: FAILURE

Here is what data we have in cause:

image

As you might see this is impossible to find out which stage failed exactly as all three stages has the same status and description. (We currently filtering them by the fact that duration.ms is less than 1 sec.) We want to be able to exclude 'skipped' or the way to find really 'failed' stage even if it's nested. Populating jenkins.pipeline.step.interruption.causes with 'skipped due to earlier failure(s)' might help in this.

kuisathaverat commented 11 months ago

if you filter by event.outcome==failure and labels.jenkins_pipeline_step_type!=stage then you have the steps that failed, if you order by labels.jenkins_pipeline_step_id you will know which one failed first, I am not wrong will be only one. To know the stage that failed you can follow a similar approach, filter by event.outcome=failure and labels.jenkins_pipeline_step_type==stage then you have the steps that failed, if you order by labels.jenkins_pipeline_step_id you will know which one failed first (the minimum value will work too)

ipleten commented 11 months ago

I am not working with Jenkins anymore so I can't test this. Main goal of ticket was to find the reason why stage was not run or failed (including skipped due to conditions).

kuisathaverat commented 11 months ago

Main goal of ticket was to find the reason why stage was not run or failed (including skipped due to conditions).

It is how Jenkins pipelines manage stages, it has nothing to do with the plugin. In the plugin latest version of the plugin we have included a label to mark the step that failed jenkins.pipeline.step.result