Occasionally, we observe tests running on GitHub Actions fail due to the timeout limit. It has some tendency but is not deterministic; in most cases, "valgrind-tvorm" causes problems and is too difficult to debug. For the moment, it seems that the maintainers should try to press the "Re-run failed jobs" button when some jobs fail. If a failing job fails again and again, then it is indeed a bug (or the test is too time-consuming).
Maybe some jobs are just slow. The performance of test programs running on virtual machines may fluctuate and unfortunate cases would exceed the time limit. This looks true for some "valgrind-parform" cases, but when I increased the time limit from 5 minutes to 10 minutes, there was still a timeout in "valgrind-tvorm"; it looked like a real hang.
Maybe out-of-memory issues. FORM now requires over 2GB of memory allocation. In addition, Valgrind may consume some more memory. According to GitHub, however, jobs are running on Standard_DS2_v2 virtual machines in Microsoft Azure, which have 7 GiB RAM + 14 GiB disk, and it looks enough for our jobs.
Maybe some real logical problems in threaded parallelization. Indeed, Helgrind tells us there are a lot of possible misuses of the Pthreads API in the TFORM code. For example, valgrind --tool=helgrind sources/tvorm -w4 test.frm with an empty FORM program
.end
would give something like this log. Another example log by valgrind --tool=helgrind sources/tvorm -w4 -D TEST=Issue137_4 check/features.frm is here. Hopefully, many of them are indeed not problematic (by some implicit assumptions not expressed in the C language and generated machine code), but some of them may appear as real problems and cause deadlocks, with different I/O characteristics on virtual machines.
Occasionally, we observe tests running on GitHub Actions fail due to the timeout limit. It has some tendency but is not deterministic; in most cases, "valgrind-tvorm" causes problems and is too difficult to debug. For the moment, it seems that the maintainers should try to press the "Re-run failed jobs" button when some jobs fail. If a failing job fails again and again, then it is indeed a bug (or the test is too time-consuming).
See also https://github.com/actions/runner/issues/1326 and issues linked there (but the cause may be different from our issue).
Some thoughts and hypotheses:
Maybe some jobs are just slow. The performance of test programs running on virtual machines may fluctuate and unfortunate cases would exceed the time limit. This looks true for some "valgrind-parform" cases, but when I increased the time limit from 5 minutes to 10 minutes, there was still a timeout in "valgrind-tvorm"; it looked like a real hang.
Maybe out-of-memory issues. FORM now requires over 2GB of memory allocation. In addition, Valgrind may consume some more memory. According to GitHub, however, jobs are running on
Standard_DS2_v2
virtual machines in Microsoft Azure, which have 7 GiB RAM + 14 GiB disk, and it looks enough for our jobs.Maybe some real logical problems in threaded parallelization. Indeed, Helgrind tells us there are a lot of possible misuses of the Pthreads API in the TFORM code. For example,
valgrind --tool=helgrind sources/tvorm -w4 test.frm
with an empty FORM programwould give something like this log. Another example log by
valgrind --tool=helgrind sources/tvorm -w4 -D TEST=Issue137_4 check/features.frm
is here. Hopefully, many of them are indeed not problematic (by some implicit assumptions not expressed in the C language and generated machine code), but some of them may appear as real problems and cause deadlocks, with different I/O characteristics on virtual machines.