jenkinsci / junit-plugin

Allows JUnit-format test results to be published
https://plugins.jenkins.io/junit
MIT License
76 stars 337 forks source link

Pipeline is stuck on `Recording test results` #653

Open hiddenshadow21 opened 1 month ago

hiddenshadow21 commented 1 month ago

Jenkins and plugins versions report

Environment ```text Jenkins: 2.462.2 OS: Windows Server 2022 - 10.0 Java: 21.0.3 - Oracle Corporation (Java HotSpot(TM) 64-Bit Server VM) --- ace-editor:1.1 ant:511.v0a_a_1a_334f41b_ antisamy-markup-formatter:162.v0e6ec0fcfcf6 apache-httpcomponents-client-4-api:4.5.14-208.v438351942757 asm-api:9.7-33.v4d23ef79fcc8 authentication-tokens:1.119.v50285141b_7e1 aws-credentials:231.v08a_59f17d742 aws-java-sdk-ec2:1.12.767-467.vb_e93f0c614b_6 aws-java-sdk-minimal:1.12.767-467.vb_e93f0c614b_6 bootstrap4-api:4.6.0-6 bootstrap5-api:5.3.3-1 bouncycastle-api:2.30.1.78.1-248.ve27176eb_46cb_ branch-api:2.1178.v969d9eb_c728e build-timeout:1.33 caffeine-api:3.1.8-133.v17b_1ff2e0599 checks-api:2.2.1 cloudbees-bitbucket-branch-source:888.v8e6d479a_1730 cloudbees-folder:6.955.v81e2a_35c08d3 command-launcher:115.vd8b_301cc15d0 commons-compress-api:1.26.1-2 commons-lang3-api:3.17.0-84.vb_b_938040b_078 commons-text-api:1.12.0-129.v99a_50df237f7 config-file-provider:978.v8e85886ffdc4 configurationslicing:548.ve92d48e66b_f8 copyartifact:749.vfb_dca_a_9b_6549 credentials:1380.va_435002fa_924 credentials-binding:681.vf91669a_32e45 data-tables-api:2.1.6-1 display-url-api:2.204.vf6fddd8a_8b_e9 docker-commons:443.v921729d5611d docker-workflow:580.vc0c340686b_54 dtkit-api:3.0.2 durable-task:577.v2a_8a_4b_7c0247 ec2:1688.v8c07e01d657f echarts-api:5.5.1-1 eddsa-api:0.3.0-4.v84c6f0f4969e email-ext:1844.v3ea_a_b_842374a_ emailext-template:1.5 envinject:2.919.v009a_a_1067cd0 envinject-api:1.199.v3ce31253ed13 extended-read-permission:53.v6499940139e5 external-monitor-job:215.v2e88e894db_f8 favorite:2.221.v19ca_666b_62f5 file-operations:266.v9d4e1eb_235b_a_ font-awesome-api:6.6.0-2 ghprb:1.42.2 git:5.5.1 git-client:5.0.0 git-server:126.v0d945d8d2b_39 github:1.40.0 github-api:1.321-468.v6a_9f5f2d5a_7e github-branch-source:1797.v86fdb_4d57d43 gradle:2.13 greenballs:1.15.1 gson-api:2.11.0-41.v019fcf6125dc handlebars:3.0.8 handy-uri-templates-2-api:2.1.8-30.v7e777411b_148 htmlpublisher:1.36 http_request:1.19 instance-identity:201.vd2a_b_5a_468a_a_6 ionicons-api:74.v93d5eb_813d5f jackson2-api:2.17.0-379.v02de8ec9f64c jakarta-activation-api:2.1.3-1 jakarta-mail-api:2.1.3-1 javadoc:280.v050b_5c849f69 javax-activation-api:1.2.0-7 javax-mail-api:1.6.2-10 jaxb:2.3.9-1 jdk-tool:80.v8a_dee33ed6f0 jersey2-api:2.44-151.v6df377fff741 jira:3.13 jjwt-api:0.11.5-112.ve82dfb_224b_a_d jnr-posix-api:3.1.19-2 jobConfigHistory:1268.v75ce751da_911 joda-time-api:2.13.0-85.vb_64d1c2921f1 jquery:1.12.4-1 jquery-detached:1.2.1 jquery3-api:3.7.1-2 jsch:0.2.16-86.v42e010d9484b_ json-api:20240303-41.v94e11e6de726 json-path-api:2.9.0-58.v62e3e85b_a_655 junit:1302.va_b_878c32eb_b_5 ldap:725.v3cb_b_711b_1a_ef locale:519.v4e20f313cfa_f lockable-resources:1315.v4ea_8e5159ec8 mailer:488.v0c9639c1a_eb_3 mapdb-api:1.0.9-40.v58107308b_7a_7 matrix-auth:3.2.2 matrix-project:838.v4d7b_7b_f9b_d4b_ maven-plugin:3.23 mercurial:1260.vdfb_723cdcc81 metrics:4.2.21-451.vd51df8df52ec mina-sshd-api-common:2.13.2-125.v200281b_61d59 mina-sshd-api-core:2.13.2-125.v200281b_61d59 momentjs:1.1.1 monitoring:1.99.0 msbuild:1.35 mstest:1.0.5 node-iterator-api:55.v3b_77d4032326 nodejs:1.6.2 nodelabelparameter:1.12.0 nunit:485.ve8a_85357320d okhttp-api:4.11.0-172.vda_da_1feeb_c6e pam-auth:1.11 parameterized-scheduler:277.v61a_4b_a_49a_c5c performance:962.v95a_4913d332e perfpublisher:8.09 pipeline-build-step:540.vb_e8849e1a_b_d8 pipeline-github-lib:61.v629f2cc41d83 pipeline-graph-analysis:216.vfd8b_ece330ca_ pipeline-groovy-lib:730.ve57b_34648c63 pipeline-input-step:495.ve9c153f6067b_ pipeline-milestone-step:119.vdfdc43fc3b_9a_ pipeline-model-api:2.2214.vb_b_34b_2ea_9b_83 pipeline-model-definition:2.2214.vb_b_34b_2ea_9b_83 pipeline-model-extensions:2.2214.vb_b_34b_2ea_9b_83 pipeline-rest-api:2.34 pipeline-stage-step:312.v8cd10304c27a_ pipeline-stage-tags-metadata:2.2214.vb_b_34b_2ea_9b_83 pipeline-stage-view:2.34 pipeline-utility-steps:2.17.0 plain-credentials:183.va_de8f1dd5a_2b_ plugin-util-api:5.1.0 popper-api:1.16.1-3 popper2-api:2.11.6-5 powershell:2.1 pubsub-light:1.18 resource-disposer:0.24 role-strategy:743.v142ea_b_d5f1d3 scm-api:696.v778d637b_a_762 script-security:1362.v67dc1f0e1b_b_3 shelve-project-plugin:3.2 simple-queue:1.4.4 simple-theme-plugin:196.v96d9592f4efa_ slack:741.v00f9591c586d snakeyaml-api:2.3-123.v13484c65210a_ sse-gateway:1.27 ssh-agent:376.v8933585c69d3 ssh-credentials:343.v884f71d78167 ssh-slaves:2.973.v0fa_8c0dea_f9f sshd:3.330.vc866a_8389b_58 structs:338.v848422169819 subversion:1275.va_7b_014f3fc2c test-stability:2.4 thinBackup:2.1.1 timestamper:1.27 token-macro:400.v35420b_922dcb_ trilead-api:2.147.vb_73cc728a_32e uno-choice:2.8.3 variant:60.v7290fc0eb_b_cd windows-slaves:1.8.1 workflow-aggregator:600.vb_57cdd26fdd7 workflow-api:1336.vee415d95c521 workflow-basic-steps:1058.vcb_fc1e3a_21a_9 workflow-cps:3969.vdc9d3a_efcc6a_ workflow-cps-global-lib:612.v55f2f80781ef workflow-durable-task-step:1371.vb_7cec8f3b_95e workflow-job:1436.vfa_244484591f workflow-multibranch:795.ve0cb_1f45ca_9a_ workflow-scm-step:427.v4ca_6512e7df1 workflow-step-api:678.v3ee58b_469476 workflow-support:926.v9f4f9b_b_98c19 ws-cleanup:0.46 xunit:3.1.5 ```

What Operating System are you using (both controller, and any agents involved in the problem)?

Both on Windows Server 2022

Reproduction steps

  1. Setup jenkins controller with junit plugin nad global env vars:
  2. Setup jenkins agent on windows server 2022 with:
    • ~50GB free of disk space
    • latest .Net 8 sdk
    • git
    • trx2junit on agent:
      dotnet tool install -g trx2junit
    • label windows-agent
  3. Run pipeline on agent:

    
    def runTestSuite(String testSuiteName) {
    script {
        try {
            command = "dotnet test"
            command += " --configuration Release"
            command += " --no-build"
            command += " --blame-hang"
            command += " --blame-hang-dump-type ${env.BLAME_HANG_DUMP_TYPE}"
            command += " --blame-hang-timeout ${env.TEST_BLAME_HANG_TIMEOUT}"
            command += " --logger \"trx;verbosity=detailed\""
            command += " --logger \"console;verbosity=detailed\""
            command += " --results-directory \"./\""
    
            bat command
        }
        catch (Exception e) {
            echo e.toString()
        }
        finally {
            bat 'trx2junit .\\*.trx'
            try {
                junit testResults: '*.xml', testDataPublishers: [[$class: 'StabilityTestDataPublisher']]
            }
            catch(Exception e){
                echo e.toString()
                currentBuild.result = "FAILURE"
            }
        }
    }
    }

pipeline { agent { label 'windows-agent' }

options {
    timeout(time: 4, unit: 'HOURS')
    timestamps()
}

environment {
    repoUrl = 'https://github.com/ravendb/ravendb.git'
    sourceCodeDir = 's'
}

stages {
    stage ('Clone') {
        steps {
            dir (env.sourceCodeDir) {
                script {
                    bat "git clone ${env.repoUrl} ."
                }
            }
        }
    }

    stage ('Tests') {
        steps {
            script {
                testsFailed = false
            }

            dir (env.sourceCodeDir) {
                script {
                    bat "git clean -f -x -d"
                    bat "dotnet restore"
                    bat "dotnet build -c Release --no-restore"

                    dir("test\\SlowTests") {
                        runTestSuite("SlowTests")
                    }
                }
            }
        }
    }
}

}


### Expected Results

Pipeline goes through and reports tests.

### Actual Results

Pipeline is stuck on `Recording Test Results`:

16:23:20 [Pipeline] bat 16:23:20
16:23:20 c:\Jenkins\workspace\PR_Tests\s\test\SlowTests>trx2junit .*.trx 16:23:20 trx2junit (c) gfoidl -- v2.0.4 16:23:20 https://github.com/gfoidl/trx2junit 16:23:20
16:23:20 Converting 1 trx file(s) to JUnit-xml... 16:23:20 Converting '.\test-user_test-host_2024-10-02_05_58_22.trx' to '.\test-user_test-host_2024-10-02_05_58_22.xml' 16:23:23 done in 2.3058151 seconds. bye. 16:23:23 [Pipeline] junit 16:23:23 Recording test results 18:38:14 Cancelling nested steps due to timeout



### Anything else?

Attaching problematic xml file.
Just a side note - disk space is reclaimed after `dotnet test` run is finished.

[test-user_test-host_2024-10-02_05_58_22.zip](https://github.com/user-attachments/files/17256876/test-user_test-host_2024-10-02_05_58_22.zip)

### Are you interested in contributing a fix?

_No response_
timja commented 1 month ago

Strange the test file is tiny and it works fine for me with no delay (on Mac).

Are you able to get a few thread dumps while this is ongoing? Both on the controller and the agent its running on please? https://www.jenkins.io/doc/book/troubleshooting/obtaining-a-thread-dump/

hiddenshadow21 commented 1 month ago

The issue just happened again on 3 windows agents, attaching thread dumps from controller and agents.

jenkins-thread-dumps.zip

mdealer commented 1 month ago

I got curious, so on my side I tried this

def c = { target ->
    node('WINDOWS') {
        def r = junit(keepTestNames: true, testResults: "test-user_test-host_2024-10-02_05_58_22.xml", testDataPublishers: [attachments(), [$class: 'StabilityTestDataPublisher']])
        echo "${r.totalCount} ${r.failCount}"
    }
}
def p1c = { c('p1') }
def p2c = { c('p2') }
def m = [ 'p1': p1c, 'p2': p2c ]
parallel m

And got this in one of the builds

java.util.ConcurrentModificationException
    at java.base/java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1486)
    at java.base/java.util.TreeMap$ValueIterator.next(TreeMap.java:1531)
    at java.base/java.util.AbstractCollection.addAll(AbstractCollection.java:335)
    at java.base/java.util.HashSet.<init>(HashSet.java:121)
    at hudson.tasks.test.TestObject.uniquifyName(TestObject.java:395)
    at hudson.tasks.junit.PackageResult.getSafeName(PackageResult.java:87)
    at hudson.tasks.test.TestObject.getId(TestObject.java:87)
    at hudson.tasks.test.TestObject.getId(TestObject.java:91)
    at hudson.tasks.test.TestObject.getId(TestObject.java:91)
    at de.esailors.jenkins.teststability.StabilityTestData.getTestAction(StabilityTestData.java:61)
    at hudson.tasks.junit.TestResultAction.getActions(TestResultAction.java:378)
    at hudson.tasks.test.TestObject.getTestActions(TestObject.java:248)
    at hudson.tasks.test.TestObject.getTestAction(TestObject.java:261)
    at de.esailors.jenkins.teststability.StabilityTestDataPublisher.getPreviousHistory(StabilityTestDataPublisher.java:133)
    at de.esailors.jenkins.teststability.StabilityTestDataPublisher.contributeTestData(StabilityTestDataPublisher.java:82)
    at hudson.tasks.junit.JUnitResultArchiver.parseAndSummarize(JUnitResultArchiver.java:325)
    at hudson.tasks.junit.pipeline.JUnitResultsStepExecution.run(JUnitResultsStepExecution.java:62)
    at hudson.tasks.junit.pipeline.JUnitResultsStepExecution.run(JUnitResultsStepExecution.java:27)
    at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)

I checked the code and the existing operations around accessing the history while populating it are unsafe but also very slow the way StabilityTestDataPublisher is doing it. I then ran even more parallel parallel builds and they do appear stuck eventually, although they do finish after a long time.

Try adding a lock('result-recording-lock') { timeout(15) { ... } } around the junit step. The bottleneck is around the Map UNIQUIFIED_NAMES, which is synchronized anyway, being my initial assumption after a quick glance-over.

hiddenshadow21 commented 1 month ago

Thanks for the analysis. I will try with lock and timeout there. Hope it will resolve this issue.