vermaseren / form

The FORM project for symbolic manipulation of very big expressions
GNU General Public License v3.0
982 stars 118 forks source link

Hangs on GitHub Actions #417

Open tueda opened 1 year ago

tueda commented 1 year ago

Occasionally, we observe tests running on GitHub Actions fail due to the timeout limit. It has some tendency but is not deterministic; in most cases, "valgrind-tvorm" causes problems and is too difficult to debug. For the moment, it seems that the maintainers should try to press the "Re-run failed jobs" button when some jobs fail. If a failing job fails again and again, then it is indeed a bug (or the test is too time-consuming).

See also https://github.com/actions/runner/issues/1326 and issues linked there (but the cause may be different from our issue).

Some thoughts and hypotheses:

  1. Maybe some jobs are just slow. The performance of test programs running on virtual machines may fluctuate and unfortunate cases would exceed the time limit. This looks true for some "valgrind-parform" cases, but when I increased the time limit from 5 minutes to 10 minutes, there was still a timeout in "valgrind-tvorm"; it looked like a real hang.

  2. Maybe out-of-memory issues. FORM now requires over 2GB of memory allocation. In addition, Valgrind may consume some more memory. According to GitHub, however, jobs are running on Standard_DS2_v2 virtual machines in Microsoft Azure, which have 7 GiB RAM + 14 GiB disk, and it looks enough for our jobs.

  3. Maybe some real logical problems in threaded parallelization. Indeed, Helgrind tells us there are a lot of possible misuses of the Pthreads API in the TFORM code. For example, valgrind --tool=helgrind sources/tvorm -w4 test.frm with an empty FORM program

    .end

    would give something like this log. Another example log by valgrind --tool=helgrind sources/tvorm -w4 -D TEST=Issue137_4 check/features.frm is here. Hopefully, many of them are indeed not problematic (by some implicit assumptions not expressed in the C language and generated machine code), but some of them may appear as real problems and cause deadlocks, with different I/O characteristics on virtual machines.