Closed ahamaline closed 3 months ago
Thank you for your report. To better understand and address the issue, could you provide more specific details? For example,
The top line on my terminal reads Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 5.15.133.1-microsoft-standard-WSL2 x86_64) The Windows version is Windows 11 version 22H2.
My FORM version is 5.0.0-beta.
I'm afraid I won't be able to work on producing a minimal example right now, life is busy...
Reporting that I tried swapping the FORM version to the one in the Git branch "4.3"; version 4.3.1. I got the same behavior as I described above.
So it turns out that I am having similar problems running my job on a native Linux system, an HPC cluster running Rocky Linux 8. I changed the title on the issue accordingly. The log file stops being updated at a "difficult" point in the run, while the computation itself continues. In these cases the computation itself stopped a few hours later, but without a logfile I have no idea why.
I don't think the logfile issue is likely to be due to a bug in my code, since it has happened at several different points in the computation. The only thing that seems potentially non-standard about my job is that I call TFORM from within Python, using subprocess.run, rather than from the terminal.
One clue: the last entry in the log file, in each of these incidents, was a "terms left" statistic reported by one of the workers, indicating that its small buffer was filled. I think that in each case it was the first time this happened in the job. Conversely, StageSort had been called before several times by each of the workers without a problem.
I am interested in trying to produce a minimal example to help you debug this, but the truth is I don't really know how to go about doing that; I would just be trying random things blindly. Perhaps you can provide some guidance about what sort of things to try?
Thank you for the update. I think one thing we need to confirm is whether the problem occurs in the absence of Python.
One clue: the last entry in the log file, in each of these incidents, was a "terms left" statistic reported by one of the workers, indicating that its small buffer was filled. I think that in each case it was the first time this happened in the job. Conversely, StageSort had been called before several times by each of the workers without a problem.
I am interested in trying to produce a minimal example to help you debug this
Is it possible for you to provide
Just to check: is this a log file generated with -l
or by redirecting stdout to a file?
It is generated with the -f flag in the TFORM call.
Right now I am trying it without Python; I will upload the files when I see the outcome.
It turns out that the issue does not occur when I call TFORM directly from the command line, nor when I do so through Python, by calling python3
and then
>>>import subprocess
>>>p=subprocess.run(("tform -w4 -f A_A_A_A.frm").split(),stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=True)
which is the same syntax as the call in my program. It only occurs when I run the program itself. I have no idea what the difference could be.
The program is MatchMakerEFT by Jose Santiago's group at the University of Granada, with some minor changes of my own. It computes the Wilson Coefficients for the EFT in terms of the parameters of a UV model. It uses Feynrules to get the Feynman Rules for two models (UV and EFT), calls QGRAF to get the relevant diagrams, computes their amplitudes in FORM, and does the matching calculation in Mathematica, with all of these being called by Python.
One of my changes was calling TFORM instead of FORM. I don't know why the authors did not do this; perhaps they tried and had the same trouble I am having...
I doubt you want to install the dependencies and run the full program, so I am sending only the files that are involved in the TFORM call. In this example we are computing the one-loop four-photon amplitude in QED, to fourth (lowest) order in the momenta. A_A_A_A_form.zip
The call is on line 150 of run_form.py, which calls the FORM program A_A_A_A.frm. The formula to be computed is given in A_A_A_A.aux, and the calculation itself is in procedures.h. The main part of the amplitude is printed in the output file A_A_A_A.out, but there is a numerical coefficient that is supposed to appear as the last line of the log file A_A_A_A.log. So failure to complete the log file ruins the result, even though the .out file is correct. I am also including the good log file that I got when calling TFORM from Python outside the program, which I renamed to A_A_A_A_good.log.
The only change that is needed in order to run the computation outside of MatchMakerEFT is to change the path for the procedures.h files, on line 3 of A_A_A_A.frm.
I am confused by -F
actually: the manual states that output should only be to the log, and no statistics should be printed. Rather I get an output on screen, without statistics, and output with statistics in the log.
For your files: is A_A_A_A.frm supposed to be present in the directory, or is it generated? So far I did not get the python code to run. I had to copy a few extra files from the libs dir of the matchmaker repo.
Oh, I'm so sorry, the .frm file got left out of the zip by accident. I fixed it now, as an edit to my previous reply.
I indeed do not get any output to the console with the -f flag, but I do get statistics in the log file. This is probably because I have On Statistics;
explicitly, in generaldefs.h.
I'm sorry if I gave the impression that you would be able to run the Python code as is! I only included run_form.py so that you could see the TFORM call in its context, in case there is some clue there. If you want to actually run the Python code you will probably need to install MatchMakerEFT and its dependencies and create the models from my .fr files. Let me know if you want to do that and I'll provide some instructions and the files.
Actually, I didn't know the exact behaviour of -f
and -F
options, niether...
The following code
L F = 1;
P;
.end
prints F
to stdout even if -f
or -F
is used.
On the other hand,
L F = 1;
P +f;
.end
doesn't print F
to stdout when -f
or -F
option is used.
I combined the provided files into one file. This file uses print +f
at the end, so if I put -F
then it must not give any output.
Then the problem is that somehow v5.0.0-beta.1-30-ge99543a
prints, with the -F
option (tform -w24 -F test.frm
), in most cases,
amplAAAA =
- 1/315*esfull(1);
to stdout and then test.log
stopped at
Thread 22 reporting
Time = 4.13 sec Generated terms = 861185
expr1 5017 Terms left = 5488
Bytes used = 345224
To me, somehow 5.0beta with -f
or 4.3 (v4.3.1-16-g86e5def
) with -f
and -F
don't print amplAAAA
to stdout (it's written into test.log
). Weird...
Edit: by the way, thank you for providing your files, @ahamaline!
Oh dear, I see that the .aux file I provided only contains the formula for one of the six diagrams. I apologize. I've fixed it now in another edit to that comment. That explains why the numerical coefficient you get is different from mine in A_A_A_A_good.log. Actually, I'd expect that if you run it without some of the diagrams you should be left at the end with unused terms expr2+...+expr6 ; I'm not sure why that doesn't appear in your output.
I'm glad to see you were able to reproduce the bug of the logfile stopping! So apparently it can happen without Python as well. Perhaps it is simply a matter of chance that I didn't get the problem when running outside of MME? But if it's random, why does it have the regularity of happening at the first "terms left" statistic?
Note by the way that in my case that happened on the second diagram; I guess the result of the first diagram was still taking up space in the small buffer. Also note that the issue does not occur with the default setting for the small buffer size. My form.set (computed by Formset) makes the buffers rather small, in accordance with the available memory on my laptop.
As for outputs to stdout - I did not see any such outputs, neither when running directly, nor through Python.
Running @tueda 's file on my machine, I get the same behaviour on the master branch with -F
. Not with tvorm
though, there I have nothing on stdout, and no thread statistics in the log (only the master statistics).
Did the log file stop for you as well? And if so, did it happen immediately after the first "terms left"?
Yes, the log stops. It is not the "first" terms left
, though. If you run with -l
, you get threadstats printed immediately after entering this module. For me it takes a minute, until the strange threadstats print appears and the log stops.
I will try to run this overnight with valgrind, but it is very slow...
What do you mean by "this module"?
I also do get threadstats earlier in the logfile, but for me they were all "terms in thread" rather than "terms left". If you did get "terms left" before the stop, did you also get "terms active"?
And what was "the strange threadstats print" that you got before the stop? For me the last entry looks like a normal threadstats; again except for the point that it's the first "terms left" to appear.
BTW, I just noticed another strange thing to add to the confusion: in my file A_A_A_A_good.log, from the time I called TFORM from the Python REPL, there are many. many "terms left" statistics and some "terms active", as if the buffers were truly tiny - although I had the same form.set file in the directory for all the runs.
On the other hand, in the run (from MME) where the logfile stopped, "terms left" does not appear at all for the first diagram (which makes sense because all of the "bytes used" statistics are smaller than smallsize in form.set), but then it does appear for the second diagram, with a smaller "bytes used". And that's when the log stops.
By "this module" I mean the code up to the sort in the linearize procedure, when it is called from L848 of tueda's file. This is most of the runtime of the script.
In my log there are no threadstats at all, until the one that is printed when it stops:
Time = 0.09 sec Generated terms = 1
expr6 Terms in output = 1
Bytes used = 36
Thread 1 reporting
Time = 65.25 sec Generated terms = 6925252
expr1 36575 Terms left = 33932
Bytes used = 2001160
After compiling tform with -g
in preparation for valgrind, I actually don't seem to be able to reproduce the issue anymore.
So I think you are saying that the fact that a threadstat appeared at all is anomalous, since there were no others? But I think that threadstats are on by default. Do we know that there were other threadstats that could have appeared earlier? Perhaps this is simply the first threadstats of the run, and it took a minute because the thread was working on it without clearing the small buffer?
You wrote that with the -l flag, you did get many threadstats. In that run, were the "bits used" in those "terms left" threadstats much smaller than this 2Mb? If for some reason there is a difference in behavior regarding when the small buffer gets cleared, that could explain why you got more threadstats in that case, and also why they appeared after a shorter time.
I've fixed it now in another edit to that comment.
I think this means https://github.com/vermaseren/form/issues/470#issuecomment-1989329018, but now opening the link "A_A_A_A_form.zip" gives "Error 404 Not Found".
That's weird; for me it downloads nicely. Do you thing it will help if I upload it again?
Well, doesn't hurt to try. Here it is again: A_A_A_A_form.zip
Sorry, maybe it was some cache problem of the web browser in my side. Thanks for your help.
I did not manage to get valgrind
to complete overnight. It runs too slowly for multithreaded processes.
If you add in WriteString
in tools.c
something like:
printf("AC.LogHandle = %d: %.*s\n", AC.LogHandle, num, str);
then you see all of the threadstats arrive in this function, in -F
mode. But AC.LogHandle
is -1
, so they don't end up in the file. In -F
mode, when the strange lone print of the threadstats appears in the log, for some reason AC.LogHandle
is 2
. After that you continue you see everything arriving in this function, including the final print of the result, but AC.LogHandle
is always -1
, so nothing ever ends up in the log file again.
There are many places in the code where AC.LogHandle
is stored in something like oldLogHandle
and then put back after some other operations... maybe something is missed in some case somewhere, so that AC.LogHandle
remains -1
for the rest of the run?
WriteStats
contains locks to prevent threads from writing at the same time, but the updating of LogHandle is not protected by locks. I wonder if this is just a race condition, that no one noticed because -F
or -ll
are not used much?
In all of my runs I did see threadstats in the log. But in the cases where it eventually stopped, there were only "terms in thread", and not "terms left" or "terms active", while in the runs without the problem I got all the types of threadstats.
Damn. I just realized that in all of the runs I did where everything worked, I used -f
and not -F
as in the call in run_form.py. I tried again, calling TFORM directly and with -F
, and sure enough it stopped, just like you guys are seeing. So anything about Python or MME was a red herring.
In this case the stop was at the first diagram, like in the example you showed https://github.com/vermaseren/form/issues/470#issuecomment-1992156038, although in my other runs it only stopped at the second diagram. But like before, there were "terms in thread" threadstats earlier in the log.
Oh, and I also got the final line output to the terminal, like you guys saw and unlike what happened with -f
.
Looking at the manual, I see that the flags -F
, -L
, and -ll
are meant to suppress the "intermediate statistics". I guess that must mean the "terms left" and "terms active" statistics, whether or not multithreading is being used. And conversely, "terms in thread" statistics, when a thread finishes a module, must not be considered as "intermediate".
I tried another test: I added Off threadstats;
in the code, and ran it with -F
. In this case everything worked properly - there were no threadstats of any kind, the logfile was completed properly, and the final line was not output to the terminal.
So it seems that the output to stdout is another symptom of the same bug that causes the logfile to stop.
I can't reproduce this with https://github.com/jodavies/form/commit/9708e234980ec55207989559dfa2b9faad834361 , can you test it?
Here is a small program that has the problems with tform -w2 -F
:
#:TermsInSmall 5
#define N1 "1"
#define N2 "5"
S x1,...,x9;
L F = (x1+...+x9)^`N1';
.sort
multiply (x1+...+x9)^`N2';
P +f;
.end
which leads to the logfile stopping and prints F
to stdout.
This test looks good on my branch also. I am not sure how to include it in the test suite though, on the master branch for me it does not have the issue 100% of the time.
I have confirmed that v5.0.0-beta.1-32-g0da393c
(tform -w2 -F
) works well with the small test.
These ErrorMessageLock
make sense to me: Helgrind log and drd log for the small test shows that lines around sort.c:848
on the master branch:
are dangerous in the sense of data race:
$ grep EndSort helgrind.log | tr -s ' ' | cut -d ' ' -f 4,5 | sort | uniq
EndSort (sort.c:1005)
EndSort (sort.c:1006)
EndSort (sort.c:1007)
EndSort (sort.c:1076)
EndSort (sort.c:1101)
EndSort (sort.c:1102)
EndSort (sort.c:726)
EndSort (sort.c:848)
EndSort (sort.c:850)
EndSort (sort.c:851)
EndSort (sort.c:852)
EndSort (sort.c:986)
$ grep EndSort drd.log | tr -s ' ' | cut -d ' ' -f 4,5 | sort | uniq
EndSort (sort.c:1005)
EndSort (sort.c:1006)
EndSort (sort.c:1007)
EndSort (sort.c:1088)
EndSort (sort.c:1101)
EndSort (sort.c:1102)
EndSort (sort.c:726)
EndSort (sort.c:848)
EndSort (sort.c:850)
EndSort (sort.c:851)
EndSort (sort.c:852)
EndSort (sort.c:986)
EndSort (sort.c:999)
Example:
==453604== Possible data race during read of size 4 at 0x2F4370 by thread #3
==453604== Locks held: 1, at address 0x532EF98
==453604== at 0x222B72: EndSort (sort.c:848)
==453604== by 0x2541C7: RunThread (threads.c:1472)
==453604== by 0x4855B3A: mythread_wrapper (in /home/linuxbrew/.linuxbrew/Cellar/valgrind/3.22.0/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==453604== by 0x4E08C1D: start_thread (in /home/linuxbrew/.linuxbrew/Cellar/glibc/2.35_1/lib/libc.so.6)
==453604== by 0x4E88B4F: clone (in /home/linuxbrew/.linuxbrew/Cellar/glibc/2.35_1/lib/libc.so.6)
==453604==
==453604== This conflicts with a previous write of size 4 by thread #2
==453604== Locks held: 1, at address 0x532ED98
==453604== at 0x222BBC: EndSort (sort.c:852)
==453604== by 0x2541C7: RunThread (threads.c:1472)
==453604== by 0x4855B3A: mythread_wrapper (in /home/linuxbrew/.linuxbrew/Cellar/valgrind/3.22.0/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==453604== by 0x4E08C1D: start_thread (in /home/linuxbrew/.linuxbrew/Cellar/glibc/2.35_1/lib/libc.so.6)
==453604== by 0x4E88B4F: clone (in /home/linuxbrew/.linuxbrew/Cellar/glibc/2.35_1/lib/libc.so.6)
==453604== Address 0x2f4370 is 6928 bytes inside data symbol "A"
I ran the big program (test470.frm
), which include diags2 to diags6, with v5.0.0-beta.1-32-g0da393c
(tform -w24 -F) 10 times. It successfully wrote the whole output to the log file 10 out of 10 times.
OK, great. I will add some comments and squash the commits for a PR. For a test, I did not see a way to supply extra command-line args (-F
) for a particular test in check.rb, but if we add that and have the test run 10 times or so, it is very likely to pick up a problem.
For a test, I did not see a way to supply extra command-line args (-F) for a particular test in check.rb
This has not been implemented yet, but you can see a workaround in check/checkpoint/checkpoint.frm
.
To write a test for the checkpoint mechanism, the -R
option is mandatory, so FORM is executed from FORM:
#system -e `FORM' -R 1.frm
Here FORM
is, for example, /path/to/tform -w4
, passed as an environment variable, and 1.frm
is the generated temporary FORM file extracted from the (first) test case.
I wrote a test here. I will push it after merging your PR.
Confirming that I pulled & installed the new commit and it works properly for me. Thank you!
When I run Tform on my Windows laptop using WSL, the log file suddenly stops being updated at some point during the calculation, at a point when the workers are running. This has happened for me on all of the 4 occasions when Threadstats was turned on, but not otherwise. The computation itself continues without a problem.
My guess is that there is a problem with the way WSL handles the the race conditions when several threads are trying to write to the file at the same time.