vermaseren / form

The FORM project for symbolic manipulation of very big expressions
GNU General Public License v3.0
982 stars 118 forks source link

Log file writing failure with threadstats #470

Closed ahamaline closed 3 months ago

ahamaline commented 4 months ago

When I run Tform on my Windows laptop using WSL, the log file suddenly stops being updated at some point during the calculation, at a point when the workers are running. This has happened for me on all of the 4 occasions when Threadstats was turned on, but not otherwise. The computation itself continues without a problem.

My guess is that there is a problem with the way WSL handles the the race conditions when several threads are trying to write to the file at the same time.

tueda commented 4 months ago

Thank you for your report. To better understand and address the issue, could you provide more specific details? For example,

ahamaline commented 4 months ago

The top line on my terminal reads Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 5.15.133.1-microsoft-standard-WSL2 x86_64) The Windows version is Windows 11 version 22H2.

My FORM version is 5.0.0-beta.

I'm afraid I won't be able to work on producing a minimal example right now, life is busy...

ahamaline commented 4 months ago

Reporting that I tried swapping the FORM version to the one in the Git branch "4.3"; version 4.3.1. I got the same behavior as I described above.

ahamaline commented 3 months ago

So it turns out that I am having similar problems running my job on a native Linux system, an HPC cluster running Rocky Linux 8. I changed the title on the issue accordingly. The log file stops being updated at a "difficult" point in the run, while the computation itself continues. In these cases the computation itself stopped a few hours later, but without a logfile I have no idea why.

I don't think the logfile issue is likely to be due to a bug in my code, since it has happened at several different points in the computation. The only thing that seems potentially non-standard about my job is that I call TFORM from within Python, using subprocess.run, rather than from the terminal.

One clue: the last entry in the log file, in each of these incidents, was a "terms left" statistic reported by one of the workers, indicating that its small buffer was filled. I think that in each case it was the first time this happened in the job. Conversely, StageSort had been called before several times by each of the workers without a problem.

I am interested in trying to produce a minimal example to help you debug this, but the truth is I don't really know how to go about doing that; I would just be trying random things blindly. Perhaps you can provide some guidance about what sort of things to try?

tueda commented 3 months ago

Thank you for the update. I think one thing we need to confirm is whether the problem occurs in the absence of Python.

One clue: the last entry in the log file, in each of these incidents, was a "terms left" statistic reported by one of the workers, indicating that its small buffer was filled. I think that in each case it was the first time this happened in the job. Conversely, StageSort had been called before several times by each of the workers without a problem.

I am interested in trying to produce a minimal example to help you debug this

Is it possible for you to provide

  1. some of the last lines, say 10 lines, of the stopped log files, so we can see what you exactly see,
  2. the code that reproduces the issue, if you can share it. Size is no problem, though smaller is better. You can use "Attach files" or other services like GitHub Gists if the files are big.
jodavies commented 3 months ago

Just to check: is this a log file generated with -l or by redirecting stdout to a file?

ahamaline commented 3 months ago

It is generated with the -f flag in the TFORM call.

Right now I am trying it without Python; I will upload the files when I see the outcome.

ahamaline commented 3 months ago

It turns out that the issue does not occur when I call TFORM directly from the command line, nor when I do so through Python, by calling python3 and then >>>import subprocess >>>p=subprocess.run(("tform -w4 -f A_A_A_A.frm").split(),stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=True)

which is the same syntax as the call in my program. It only occurs when I run the program itself. I have no idea what the difference could be.

The program is MatchMakerEFT by Jose Santiago's group at the University of Granada, with some minor changes of my own. It computes the Wilson Coefficients for the EFT in terms of the parameters of a UV model. It uses Feynrules to get the Feynman Rules for two models (UV and EFT), calls QGRAF to get the relevant diagrams, computes their amplitudes in FORM, and does the matching calculation in Mathematica, with all of these being called by Python.

One of my changes was calling TFORM instead of FORM. I don't know why the authors did not do this; perhaps they tried and had the same trouble I am having...

I doubt you want to install the dependencies and run the full program, so I am sending only the files that are involved in the TFORM call. In this example we are computing the one-loop four-photon amplitude in QED, to fourth (lowest) order in the momenta. A_A_A_A_form.zip

The call is on line 150 of run_form.py, which calls the FORM program A_A_A_A.frm. The formula to be computed is given in A_A_A_A.aux, and the calculation itself is in procedures.h. The main part of the amplitude is printed in the output file A_A_A_A.out, but there is a numerical coefficient that is supposed to appear as the last line of the log file A_A_A_A.log. So failure to complete the log file ruins the result, even though the .out file is correct. I am also including the good log file that I got when calling TFORM from Python outside the program, which I renamed to A_A_A_A_good.log.

The only change that is needed in order to run the computation outside of MatchMakerEFT is to change the path for the procedures.h files, on line 3 of A_A_A_A.frm.

jodavies commented 3 months ago

I am confused by -F actually: the manual states that output should only be to the log, and no statistics should be printed. Rather I get an output on screen, without statistics, and output with statistics in the log.

For your files: is A_A_A_A.frm supposed to be present in the directory, or is it generated? So far I did not get the python code to run. I had to copy a few extra files from the libs dir of the matchmaker repo.

ahamaline commented 3 months ago

Oh, I'm so sorry, the .frm file got left out of the zip by accident. I fixed it now, as an edit to my previous reply.

I indeed do not get any output to the console with the -f flag, but I do get statistics in the log file. This is probably because I have On Statistics; explicitly, in generaldefs.h.

I'm sorry if I gave the impression that you would be able to run the Python code as is! I only included run_form.py so that you could see the TFORM call in its context, in case there is some clue there. If you want to actually run the Python code you will probably need to install MatchMakerEFT and its dependencies and create the models from my .fr files. Let me know if you want to do that and I'll provide some instructions and the files.

tueda commented 3 months ago

Actually, I didn't know the exact behaviour of -f and -F options, niether...

The following code

L F = 1;
P;
.end

prints F to stdout even if -f or -F is used.

On the other hand,

L F = 1;
P +f;
.end

doesn't print F to stdout when -f or -F option is used.

I combined the provided files into one file. This file uses print +f at the end, so if I put -F then it must not give any output.

Then the problem is that somehow v5.0.0-beta.1-30-ge99543a prints, with the -F option (tform -w24 -F test.frm), in most cases,

   amplAAAA =
       - 1/315*esfull(1);

to stdout and then test.log stopped at

             Thread 22 reporting
Time =       4.13 sec    Generated terms =     861185
           expr1    5017 Terms left      =       5488
                         Bytes used      =     345224

To me, somehow 5.0beta with -f or 4.3 (v4.3.1-16-g86e5def) with -f and -F don't print amplAAAA to stdout (it's written into test.log). Weird...

Edit: by the way, thank you for providing your files, @ahamaline!

ahamaline commented 3 months ago

Oh dear, I see that the .aux file I provided only contains the formula for one of the six diagrams. I apologize. I've fixed it now in another edit to that comment. That explains why the numerical coefficient you get is different from mine in A_A_A_A_good.log. Actually, I'd expect that if you run it without some of the diagrams you should be left at the end with unused terms expr2+...+expr6 ; I'm not sure why that doesn't appear in your output.

I'm glad to see you were able to reproduce the bug of the logfile stopping! So apparently it can happen without Python as well. Perhaps it is simply a matter of chance that I didn't get the problem when running outside of MME? But if it's random, why does it have the regularity of happening at the first "terms left" statistic?

Note by the way that in my case that happened on the second diagram; I guess the result of the first diagram was still taking up space in the small buffer. Also note that the issue does not occur with the default setting for the small buffer size. My form.set (computed by Formset) makes the buffers rather small, in accordance with the available memory on my laptop.

As for outputs to stdout - I did not see any such outputs, neither when running directly, nor through Python.

jodavies commented 3 months ago

Running @tueda 's file on my machine, I get the same behaviour on the master branch with -F. Not with tvorm though, there I have nothing on stdout, and no thread statistics in the log (only the master statistics).

ahamaline commented 3 months ago

Did the log file stop for you as well? And if so, did it happen immediately after the first "terms left"?

jodavies commented 3 months ago

Yes, the log stops. It is not the "first" terms left, though. If you run with -l, you get threadstats printed immediately after entering this module. For me it takes a minute, until the strange threadstats print appears and the log stops.

I will try to run this overnight with valgrind, but it is very slow...

ahamaline commented 3 months ago

What do you mean by "this module"?

I also do get threadstats earlier in the logfile, but for me they were all "terms in thread" rather than "terms left". If you did get "terms left" before the stop, did you also get "terms active"?

And what was "the strange threadstats print" that you got before the stop? For me the last entry looks like a normal threadstats; again except for the point that it's the first "terms left" to appear.

ahamaline commented 3 months ago

BTW, I just noticed another strange thing to add to the confusion: in my file A_A_A_A_good.log, from the time I called TFORM from the Python REPL, there are many. many "terms left" statistics and some "terms active", as if the buffers were truly tiny - although I had the same form.set file in the directory for all the runs.

On the other hand, in the run (from MME) where the logfile stopped, "terms left" does not appear at all for the first diagram (which makes sense because all of the "bytes used" statistics are smaller than smallsize in form.set), but then it does appear for the second diagram, with a smaller "bytes used". And that's when the log stops.

jodavies commented 3 months ago

By "this module" I mean the code up to the sort in the linearize procedure, when it is called from L848 of tueda's file. This is most of the runtime of the script.

In my log there are no threadstats at all, until the one that is printed when it stops:

 Time =       0.09 sec    Generated terms =          1
            expr6         Terms in output =          1
                          Bytes used      =         36
              Thread 1 reporting
 Time =      65.25 sec    Generated terms =    6925252
            expr1   36575 Terms left      =      33932
                          Bytes used      =    2001160

After compiling tform with -g in preparation for valgrind, I actually don't seem to be able to reproduce the issue anymore.

ahamaline commented 3 months ago

So I think you are saying that the fact that a threadstat appeared at all is anomalous, since there were no others? But I think that threadstats are on by default. Do we know that there were other threadstats that could have appeared earlier? Perhaps this is simply the first threadstats of the run, and it took a minute because the thread was working on it without clearing the small buffer?

You wrote that with the -l flag, you did get many threadstats. In that run, were the "bits used" in those "terms left" threadstats much smaller than this 2Mb? If for some reason there is a difference in behavior regarding when the small buffer gets cleared, that could explain why you got more threadstats in that case, and also why they appeared after a shorter time.

tueda commented 3 months ago

I've fixed it now in another edit to that comment.

I think this means https://github.com/vermaseren/form/issues/470#issuecomment-1989329018, but now opening the link "A_A_A_A_form.zip" gives "Error 404 Not Found".

ahamaline commented 3 months ago

That's weird; for me it downloads nicely. Do you thing it will help if I upload it again?

ahamaline commented 3 months ago

Well, doesn't hurt to try. Here it is again: A_A_A_A_form.zip

tueda commented 3 months ago

Sorry, maybe it was some cache problem of the web browser in my side. Thanks for your help.

jodavies commented 3 months ago

I did not manage to get valgrind to complete overnight. It runs too slowly for multithreaded processes.

If you add in WriteString in tools.c something like:

printf("AC.LogHandle = %d: %.*s\n", AC.LogHandle, num, str);

then you see all of the threadstats arrive in this function, in -F mode. But AC.LogHandle is -1, so they don't end up in the file. In -F mode, when the strange lone print of the threadstats appears in the log, for some reason AC.LogHandle is 2. After that you continue you see everything arriving in this function, including the final print of the result, but AC.LogHandle is always -1, so nothing ever ends up in the log file again.

There are many places in the code where AC.LogHandle is stored in something like oldLogHandle and then put back after some other operations... maybe something is missed in some case somewhere, so that AC.LogHandle remains -1 for the rest of the run?

WriteStats contains locks to prevent threads from writing at the same time, but the updating of LogHandle is not protected by locks. I wonder if this is just a race condition, that no one noticed because -F or -ll are not used much?

ahamaline commented 3 months ago

In all of my runs I did see threadstats in the log. But in the cases where it eventually stopped, there were only "terms in thread", and not "terms left" or "terms active", while in the runs without the problem I got all the types of threadstats.

ahamaline commented 3 months ago

Damn. I just realized that in all of the runs I did where everything worked, I used -f and not -F as in the call in run_form.py. I tried again, calling TFORM directly and with -F, and sure enough it stopped, just like you guys are seeing. So anything about Python or MME was a red herring.

In this case the stop was at the first diagram, like in the example you showed https://github.com/vermaseren/form/issues/470#issuecomment-1992156038, although in my other runs it only stopped at the second diagram. But like before, there were "terms in thread" threadstats earlier in the log.

Oh, and I also got the final line output to the terminal, like you guys saw and unlike what happened with -f.

ahamaline commented 3 months ago

Looking at the manual, I see that the flags -F, -L, and -ll are meant to suppress the "intermediate statistics". I guess that must mean the "terms left" and "terms active" statistics, whether or not multithreading is being used. And conversely, "terms in thread" statistics, when a thread finishes a module, must not be considered as "intermediate".

ahamaline commented 3 months ago

I tried another test: I added Off threadstats; in the code, and ran it with -F. In this case everything worked properly - there were no threadstats of any kind, the logfile was completed properly, and the final line was not output to the terminal.

So it seems that the output to stdout is another symptom of the same bug that causes the logfile to stop.

jodavies commented 3 months ago

I can't reproduce this with https://github.com/jodavies/form/commit/9708e234980ec55207989559dfa2b9faad834361 , can you test it?

tueda commented 3 months ago

Here is a small program that has the problems with tform -w2 -F:

#:TermsInSmall 5
#define N1 "1"
#define N2 "5"
S x1,...,x9;
L F = (x1+...+x9)^`N1';
.sort
multiply (x1+...+x9)^`N2';
P +f;
.end

which leads to the logfile stopping and prints F to stdout.

jodavies commented 3 months ago

This test looks good on my branch also. I am not sure how to include it in the test suite though, on the master branch for me it does not have the issue 100% of the time.

tueda commented 3 months ago

I have confirmed that v5.0.0-beta.1-32-g0da393c (tform -w2 -F) works well with the small test.

These ErrorMessageLock make sense to me: Helgrind log and drd log for the small test shows that lines around sort.c:848 on the master branch:

https://github.com/vermaseren/form/blob/e99543ad939ef823f5e8db0678aa18028e7de707/sources/sort.c#L847-L853

are dangerous in the sense of data race:

$ grep EndSort helgrind.log | tr -s ' ' | cut -d ' ' -f 4,5 | sort | uniq
EndSort (sort.c:1005)
EndSort (sort.c:1006)
EndSort (sort.c:1007)
EndSort (sort.c:1076)
EndSort (sort.c:1101)
EndSort (sort.c:1102)
EndSort (sort.c:726)
EndSort (sort.c:848)
EndSort (sort.c:850)
EndSort (sort.c:851)
EndSort (sort.c:852)
EndSort (sort.c:986)
$ grep EndSort drd.log | tr -s ' ' | cut -d ' ' -f 4,5 | sort | uniq
EndSort (sort.c:1005)
EndSort (sort.c:1006)
EndSort (sort.c:1007)
EndSort (sort.c:1088)
EndSort (sort.c:1101)
EndSort (sort.c:1102)
EndSort (sort.c:726)
EndSort (sort.c:848)
EndSort (sort.c:850)
EndSort (sort.c:851)
EndSort (sort.c:852)
EndSort (sort.c:986)
EndSort (sort.c:999)

Example:

==453604== Possible data race during read of size 4 at 0x2F4370 by thread #3
==453604== Locks held: 1, at address 0x532EF98
==453604==    at 0x222B72: EndSort (sort.c:848)
==453604==    by 0x2541C7: RunThread (threads.c:1472)
==453604==    by 0x4855B3A: mythread_wrapper (in /home/linuxbrew/.linuxbrew/Cellar/valgrind/3.22.0/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==453604==    by 0x4E08C1D: start_thread (in /home/linuxbrew/.linuxbrew/Cellar/glibc/2.35_1/lib/libc.so.6)
==453604==    by 0x4E88B4F: clone (in /home/linuxbrew/.linuxbrew/Cellar/glibc/2.35_1/lib/libc.so.6)
==453604== 
==453604== This conflicts with a previous write of size 4 by thread #2
==453604== Locks held: 1, at address 0x532ED98
==453604==    at 0x222BBC: EndSort (sort.c:852)
==453604==    by 0x2541C7: RunThread (threads.c:1472)
==453604==    by 0x4855B3A: mythread_wrapper (in /home/linuxbrew/.linuxbrew/Cellar/valgrind/3.22.0/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==453604==    by 0x4E08C1D: start_thread (in /home/linuxbrew/.linuxbrew/Cellar/glibc/2.35_1/lib/libc.so.6)
==453604==    by 0x4E88B4F: clone (in /home/linuxbrew/.linuxbrew/Cellar/glibc/2.35_1/lib/libc.so.6)
==453604==  Address 0x2f4370 is 6928 bytes inside data symbol "A"
tueda commented 3 months ago

I ran the big program (test470.frm), which include diags2 to diags6, with v5.0.0-beta.1-32-g0da393c (tform -w24 -F) 10 times. It successfully wrote the whole output to the log file 10 out of 10 times.

jodavies commented 3 months ago

OK, great. I will add some comments and squash the commits for a PR. For a test, I did not see a way to supply extra command-line args (-F) for a particular test in check.rb, but if we add that and have the test run 10 times or so, it is very likely to pick up a problem.

tueda commented 3 months ago

For a test, I did not see a way to supply extra command-line args (-F) for a particular test in check.rb

This has not been implemented yet, but you can see a workaround in check/checkpoint/checkpoint.frm.

To write a test for the checkpoint mechanism, the -R option is mandatory, so FORM is executed from FORM:

#system -e `FORM' -R 1.frm

Here FORM is, for example, /path/to/tform -w4, passed as an environment variable, and 1.frm is the generated temporary FORM file extracted from the (first) test case.

I wrote a test here. I will push it after merging your PR.

ahamaline commented 3 months ago

Confirming that I pulled & installed the new commit and it works properly for me. Thank you!