Testing farm job canceled before configured timeout

mcattamoredhat commented 1 month ago

Type of issue

Bug Report

Description

We have seen in our downstream CI several testing-farm jobs canceled after 6h 0m . Although configured timeout default value is 480m in action inputs.

The error log message doesn't not provide any details, just the message Request was canceled on user request.

This is an example of the issue https://github.com/virt-s1/rhel-edge/actions/runs/9963311207/job/27529080681 edge-rhel-94-x86 job is using default timeout value of 480m

API request output is https://api.testing-farm.io/v0.1/requests/ee761663-f05f-43c2-84d9-673545b0f037

pipeline.log shows some tests failing:

| RHEL-9.4.0-Nightly:x86_64:/tmt/plans/edge-test/edge-x86-simplified-installer | ERROR       | guest-setup.pre-artifact-installation  | guest setup | https://artifacts.osci.redhat.com/testing-farm/ee761663-f05f-43c2-84d9-673545b0f037/guest-setup-e58d3804-fbd3-4214-aff4-7e12debd843d/guest-setup-output-pre-artifact-installation.txt                                                                                                                       |
| RHEL-9.4.0-Nightly:x86_64:/tmt/plans/edge-test/edge-x86-simplified-installer | ERROR       | guest-setup.post-artifact-installation | guest setup | https://artifacts.osci.redhat.com/testing-farm/ee761663-f05f-43c2-84d9-673545b0f037/guest-setup-e58d3804-fbd3-4214-aff4-7e12debd843d/guest-setup-output-post-artifact-installation.txt

Nevertheless guest pre/post installation logs don't have any failing playbook tasks.

May you please provide some help?

Reproducer

No response

jamacku commented 1 month ago

This is very weird. @mcattamoredhat, could you please reproduce the issue with debug logging enabled?

And I agree the current log message could be better. I'll try to extend it with more information.

jamacku commented 1 month ago

So, this is a limitation of GitHub-hosted runners. From GitHub doc:

Job execution time - Each job in a workflow can run for up to 6 hours of execution time. If a job reaches this limit, the job is terminated and fails to complete.

Also, see this Discussion: https://github.com/orgs/community/discussions/25700#discussioncomment-3248791

jamacku commented 1 month ago

We can check if the execution time is greater than the timeout input and only then cancel the TF request.

mcattamoredhat commented 2 weeks ago

Hi @jamacku, although I've changed to sclorg/testing-farm-as-github-action v3.1.0, I still have this issue in a few tests such as https://github.com/virt-s1/rhel-edge/actions/runs/10553424096 (iot-f39-x86) Is there something I am missing? May you please provide some guidance? Thanks!

jamacku commented 2 weeks ago

@mcattamoredhat, I may have missed something. I'll have a look. It should work without any additional configuration from your side.

jamacku commented 2 weeks ago

The problem might be that the Job run for 5h 59min 56s and then it was killed by runner. But we are expecting 6h.

I'll adjust the value.

sclorg / testing-farm-as-github-action