System testing CI workflow gets triggered twice and one of them (tends to) fail

ManavalanG commented 1 year ago

GitHub actions triggers two system tests when I push a commit to PR - one due to commit action and the other due to PR activity. For some reason, one of them goes to completion while the other gets cancelled (not failed) by github. In this screenshot, runs #22 and #23 are at the same commit and the same is true for #24 and #25

And which workflow gets cancelled appears to be random. In this screenshot, failed run #22 was triggered by commit wheread failed run #25 by pull request

For the cancelled step, logs dont show up in their UI and instead it just says ` Error: The log was not found. It may have been deleted based on retention settings.

However if you download and check the logs (settings gear button at top right -> download logs archive), it shows that github cancelled the run

I briefly looked up why they get cancelled by github, but I didn’t get far unfortunately

ManavalanG commented 1 year ago

After discussion with @wilkb777 , we decided to turn off synchronize PR trigger. Thinking here is that push trigger will run the CI workflow anyway and therefore turning off synchronize PR trigger will not miss anything.

ManavalanG commented 1 year ago

System testing CI 34 got triggered when PR was opened, but it failed when running third (of four) quac task. Note that another CI run 33 that was on the same commit 34a6683fc6652a76b366f2abf0b35f91bc916faf passed successfully.

Note that these two CI runs were not run at the same time, and therefore we can't say they failed for some reason due to them both running together (which makes sense).

ManavalanG commented 1 year ago

I removed PR trigger from sys testing CI anyway as it didn't much of value. In the edge case where commit trigger was not sufficient, they can be triggered manually.

ManavalanG commented 1 year ago

When looking at the error logs, all failed jobs were Java based tools.

$ grep -i killed "logs_251/System testing - QuaC/12_Run QuaC system testing - WGS mode AND uses prior QC data.txt"
2023-06-26T22:21:06.0115870Z /usr/local/bin/picard: line 66: 122248 Killed                  /usr/local/bin/java -Xms512m -Xmx2g -jar /usr/local/share/picard-2.23.0-0/picard.jar CollectMultipleMetrics "I=/home/runner/work/quac/quac/.test/ngs-data/test_project/analysis/A/bam/A.bam" "O=/home/runner/work/quac/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/A/qc/picard-stats/A" "R=data/external/dependency_datasets/reference_genome/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna" "PROGRAM=null" "PROGRAM=CollectAlignmentSummaryMetrics" "PROGRAM=CollectQualityYieldMetrics"
2023-06-26T22:21:58.2552721Z /usr/local/bin/picard: line 66: 122257 Killed                  /usr/local/bin/java -Xms512m -Xmx2g -jar /usr/local/share/picard-2.23.0-0/picard.jar CollectMultipleMetrics "I=/home/runner/work/quac/quac/.test/ngs-data/test_project/analysis/B/bam/B.bam" "O=/home/runner/work/quac/quac/data/quac/results/test_project_2samples_wgs-include_priorQC/analysis/B/qc/picard-stats/B" "R=data/external/dependency_datasets/reference_genome/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna" "PROGRAM=null" "PROGRAM=CollectAlignmentSummaryMetrics" "PROGRAM=CollectQualityYieldMetrics"
2023-06-26T22:22:26.8175429Z /usr/local/bin/qualimap: line 81: 122423 Killed                  java $java_options -classpath '/usr/local/share/qualimap-2.2.2d-2/*' org.bioinfo.ngs.qc.qualimap.main.NgsSmartMain "${ARGS[@]}"
2023-06-26T22:23:44.8710104Z /usr/local/bin/qualimap: line 81: 122941 Killed                  java $java_options -classpath '/usr/local/share/qualimap-2.2.2d-2/*' org.bioinfo.ngs.qc.qualimap.main.NgsSmartMain "${ARGS[@]}"

$grep -i "Error in rule" "logs_251/System testing - QuaC/12_Run QuaC system testing - WGS mode AND uses prior QC data.txt"
2023-06-26T22:21:07.7678923Z Error in rule picard_collect_multiple_metrics:
2023-06-26T22:21:59.8089053Z Error in rule picard_collect_multiple_metrics:
2023-06-26T22:22:29.8295227Z Error in rule qualimap_bamqc:
2023-06-26T22:23:46.7406383Z Error in rule qualimap_bamqc:
$ grep -i "restart" "logs_251/System testing - QuaC/12_Run QuaC system testing - WGS mode AND uses prior QC data.txt"
2023-06-26T22:21:07.7748926Z Trying to restart job 9.
2023-06-26T22:21:59.8144395Z Trying to restart job 11.
2023-06-26T22:22:29.8399851Z Trying to restart job 8.

grep -i "finished job" 12_Run\ QuaC\ system\ testing\ -\ WGS\ mode\ AND\ uses\ prior\ QC\ data.txt
2023-06-26T22:18:41.7424093Z Finished job 5.
2023-06-26T22:18:44.0120971Z Finished job 6.
2023-06-26T22:18:44.3112870Z Finished job 13.
2023-06-26T22:18:44.6479469Z Finished job 14.
2023-06-26T22:18:55.4055393Z Finished job 32.
2023-06-26T22:18:55.8008180Z Finished job 27.
2023-06-26T22:18:55.8010104Z Finished job 16.
2023-06-26T22:18:55.8013140Z Finished job 20.
2023-06-26T22:19:01.2470505Z Finished job 23.
2023-06-26T22:19:01.5295433Z Finished job 22.
2023-06-26T22:19:01.6080686Z Finished job 21.
2023-06-26T22:19:14.4623540Z Finished job 15.
2023-06-26T22:19:14.5106487Z Finished job 19.
2023-06-26T22:19:15.2058228Z Finished job 18.
2023-06-26T22:20:39.8515808Z Finished job 2.
2023-06-26T22:20:39.9238333Z Finished job 3.
2023-06-26T22:20:43.4485161Z Finished job 17.
2023-06-26T22:20:55.1716856Z Finished job 1.
2023-06-26T22:22:48.0085416Z Finished job 7.
2023-06-26T22:23:34.2232968Z Finished job 9.
2023-06-26T22:23:42.5153298Z Finished job 11.
2023-06-26T22:26:10.4718141Z Finished job 4.
2023-06-26T22:32:16.6725180Z Finished job 12.
2023-06-26T22:32:29.2849830Z Finished job 10.
2023-06-26T22:32:34.9323170Z Finished job 26.
2023-06-26T22:32:37.0417160Z Finished job 25.
2023-06-26T22:32:41.6685750Z Finished job 24.

Here is what happened. Three jobs (two of picard_collect_multiple_metrics and one of qualimap_bamqc) failed, snakemake reran them again, and two jobs corresponding to picard_collect_multiple_metrics passed while job qualimap_bamqc failed again.

I have looked at the successful sys testing CI logs earlier, and I noticed that even they had failed runs but they were successful when they were reran by snakemake again. For example, in the sample CI, see the log file 11_Run QuaC system testing - Exome mode AND no prior QC data.txt. It had four jobs fail (two picard_collect_multiple_metrics and two picard_collect_wgs_metrics) but they passed on the rerun. Note that they are all Java tools.

So my hypothesis now is that something with Java results in job failure and their rescue (when rerun) may fail or pass, randomly. Perhaps due to memory allocated to those tools and how many Java tools run at the same time?

ManavalanG commented 1 year ago

I wonder if we are better off not running sys testing at all via github actions. We spent too much time investigating issues than what we gained from them :(

ManavalanG commented 1 year ago

Recent sys testing CI 35 got cancelled by github (##[error]The operation was canceled). There was no redundant CI run for that commit. Another log point for the pain caused by sys testing CI :(

ManavalanG commented 1 year ago

Update: Push trigger was removed as well along with the PR trigger (#80). As the github action for system testing kept getting failed/cancelled for random reasons, we decided it was not worth the ROI. We learned a lot with this workflow, but it will need more effort and time to get the workflow behave in an expected manner and not getting cancelled by github actions for unknown reasons.

uab-cgds-worthey / quac

System testing CI workflow gets triggered twice and one of them (tends to) fail #79