vermaseren / form

The FORM project for symbolic manipulation of very big expressions
GNU General Public License v3.0
985 stars 118 forks source link

Draft: Add color.h test #524

Open jodavies opened 1 month ago

jodavies commented 1 month ago

This test seems trickier than expected. Currently:

mpirun -np {1,2} parform: OK mpirun -np {3,4} parform: hang? mpirun -np {5,6} parform: crash

valgrind vorm: OK valgrind tvorm -w2: mostly hangs takes 100-200s, sometimes finishes in 2s valgrind tvorm -w4: OK, finishes in 2s

The CI sees valgrind errors in vorm, tvorm that I can't reproduce locally. Edit: I can reproduce them on ubuntu 20.04 (as the runners are using) but not in 22.04.

jodavies commented 1 month ago

I added some print statements to PF_UnpackRedefinedPreVars to try to work out what happens. There are many successful redefines, and then we have:

0 i = 0
0 trying to redefine ik1 (35) to 1
0    loop j = 0; j < 2
0       AC.pfirstnum[0] (ik1c) (37), index 35
0       AC.pfirstnum[1] (adj) (38), index 35

so it fails to find the variable it is trying to redefine in pfirstnum. So then it evaluates

if ( AC.inputnumbers[j] < inputnumber ) {

for j=2, causing the Conditional jump or move depends on uninitialised value(s).

So, why is ik1 no longer in the AC.pfirstnum array? Earlier in the program it was there (and had index 35).

tueda commented 1 month ago

Thanks for the investigation. I will look into the ParFORM issue. (You know, in programming, the person you were a month ago is a stranger. Then, the person you were more than 10 years ago is...)

By the way, maybe this (the code that gives Valgrind error) should be broken up into small unit tests.

jodavies commented 2 weeks ago

It seems the problem with valgrind and tvorm -w2 is due to the load balancing. The same issue happens with -w3. If I run this test under callgrind, we see that ThreadsProcessor makes 100s of millions of calls of LoadReadjusted (which is stealing terms from the working thread and distributing them around the idle threads) which also involve locks.

With w4, there are only ~400 calls of LoadReadjusted.

Edit: it is some kind of race condition though it seems: if I add a MesPrint in LoadReadjusted it prints only ~400 times, even when running under valgrind.

The easiest solution is to just disable valgrind for this test...

jodavies commented 2 weeks ago

Once #525 is merged we can rebase this on top and the parform tests will run successfully also.