sclorg / testing-farm-as-github-action

GitHub Action to execute tests by Testing Farm and update Pull Request status
MIT License
13 stars 11 forks source link

CI job passed and test script exit 0, but failed by timeout #186

Open yih-redhat opened 4 months ago

yih-redhat commented 4 months ago

Type of issue

None

Description

This bug is as same as https://github.com/sclorg/testing-farm-as-github-action/issues/166, as it was closed and I cannot reopen it, so created a new bug to track this.

Descripion:

  1. I have a pull request https://github.com/yih-redhat/tmt-demo/pull/42 that runs all test cases in testing-farm with v2.
  2. In this pull request, the sub job "Testing Farm - edge-9to9-9.4" is very strange, the test script is passed and exit with 0, but testing-farm plugin always report timetout error. Job link is https://artifacts.osci.redhat.com/testing-farm/befe8230-0cca-4417-816c-af13e20f564f/
  3. The sub job "Testing Farm - edge-8to9-9.4" has the same issue. And in this job, I checked all leftover processes in vm that may cause the timeout bug and printed them out in log, job link is https://artifacts.osci.redhat.com/testing-farm/24f28bf8-1c7d-47d5-9779-63723ecfb222/
  4. All sub jobs running in this pull request has same configuration. but only "Testing Farm - edge-9to9-9.4" and "Testing Farm - edge-8to9-9.4" has this strange timeout issue. Which means there might be something in the test scripts that caused this issue but not the configuration. The test script for these two sub jobs are https://github.com/yih-redhat/tmt-demo/blob/main/ostree-9-to-9.sh and https://github.com/yih-redhat/tmt-demo/blob/main/ostree-8-to-9.sh, but I cannot see anything special in these scripts, they are just normal shell scripts, like other test scripts in my repo.

Reproducer

No response

jamacku commented 4 months ago

I would suggest you to increase the timeout. test run for 9000s ~ 150min

Maximum test time '150m' exceeded. Adjust the test 'duration' attribute if necessary. https://tmt.readthedocs.io/en/stable/spec/tests.html#duration

yih-redhat commented 4 months ago

If you look into the log, you can see the test script was actually passed and exit with 0, but it looks like some child process blocked the job to complete until timeout. I have tried to set the timeout to a very long time, and still got this issue. And with the same timeout value, other sub jobs which take much longer than this script can pass.

yih-redhat commented 3 months ago

@jamacku Could you please take a look of this bug? Because of this bug, I cannot get green in our CI job, and need to check it manually to see it passed or not. This bug only happens on these two sub jobs, no matter how long I set the timeout, it will always exit 0 and then timeout.

jamacku commented 1 month ago

Is this a duplicate of #209 ?

mcattamoredhat commented 1 month ago

I believe they are slightly different. Failed test mentioned above (edge-9to9-94) took 2h 35m 21s whereas #209 issue occurs after 6h (canceled request).

jamacku commented 1 month ago

I see. But I believe that this is not our bug. We are just requesting job runs on TF and not blocking anything.