Open ggambuti opened 1 year ago
Thank you for your report!
The FindrNumber
function looks like a subroutine to perform a binary search. It is called from many places in function TermRenumber
with different buffers given as the second argument. The error message indicates that the information stored in the given buffer has something wrong (the routine tried to search from a group with -1
members, which sounds like a logical bug). To see which buffer has the problem, we need a stack backtrace.
Would it be possible for you to provide us with a minimal reproducible example?
Edit: though the routine avoids "the common bug of binary searches", somehow WORDDIF
is used, like med = hi - ((WORDDIF(hi,med))/2)
, which downcasts a 64-bit pointer difference into a 32-bit integer. This may be a problem when the buffer space for the search is enormously large (> 2^32
).
Hi,
thanks for the help!
Unfortunately I cannot reproduce this outside of the programs that I am running at the moment. These require many files and procedures to run, and only encounter the bug after several hours of runtime..
I will try to follow your suggestion and reduce the buffer space by splitting up the computation more than I was already doing. Hopefully this way I can avoid the problem.
Question: Are you running Form, TForm or ParForm? There is a slight difference in how they deal with the renumbering.
On 17 Nov 2022, at 10:06, ggambuti @.***> wrote:
Hi,
thanks for the help!
Unfortunately I cannot reproduce this outside of the programs that I am running at the moment. These require many files and procedures to run, and only encounter the bug after several hours of runtime..
I will try to follow your suggestion and reduce the buffer space by splitting up the computation more than I was already doing. Hopefully this way I can avoid the problem.
— Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/420#issuecomment-1318311218, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJPCEVGY253PG3WCLA27NLWIXYSFANCNFSM6AAAAAASCCVEPQ. You are receiving this because you are subscribed to this thread.
I am running standard FORM, with no parallelization.
@ggambuti are you loading stored expressions here?
We found a very similar looking problem when loading store files from a particular diagram. Most of the diagrams run through with no problems, and a select few have a problem.
I attach some files which reproduce this issue with form 4.0, 4.1, 4.2, 4.3.
If I put a subset of the loaded expressions into my test expression, things work OK. Only if I sum all four, or 1,3 and 4, do I have the crash.
I have not had the time to look at it in the debugger yet. But here is a workaround:
Load d2l1279subi'.res; Local test
i' = d2l1279subi'; .sort:
i';
Delete storage;
Hide testi'; .sort:
i'-2;
.sort Local test = test1+...+test4;
There is clearly a problem when loading several files simultaneously. This is rather complicated code though and I’ll have to take some time for it.
On 13 Feb 2023, at 21:48, jodavies @.***> wrote:
@ggambuti https://github.com/ggambuti are you loading stored expressions here?
We found a very similar looking problem when loading store files from a particular diagram. Most of the diagrams run through with no problems, and a select few have a problem.
I attach some files which reproduce this issue with form 4.0, 4.1, 4.2, 4.3.
If I put a subset of the loaded expressions into my test expression, things work OK. Only if I sum all four, or 1,3 and 4, do I have the crash.
form-crash.tar.gz https://github.com/vermaseren/form/files/10726170/form-crash.tar.gz — Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/420#issuecomment-1428660677, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJPCESDFRG5DYTWAHT5GQ3WXKMYPANCNFSM6AAAAAASCCVEPQ. You are receiving this because you commented.
Hi Jos,
Thanks for this workaround suggestion, this should suffice for our code in the meantime.
Thanks, Josh.
Hi Jos,
Now we've come across the issue when loading just a single save file, so the workaround no longer works.
If I run the following, with bug.res attached,
on names;
load bug.res;
local ex = d2l494sub1;
.end
FORM gives
TFORM 4.3.0 (Dec 23 2022, v4.3.0-10-g6cc038c) 64-bits 12 workers Run: Wed Oct 25 11:57:48 2023
on names;
load bug.res;
d2l494sub1 loaded
local ex = d2l494sub1;
.end
Symbols
ufoGC11 ufoGC87 ufoGC112 ep(:1000) M1 M2 eM1 M1fromUFO Qtt cA1 cA2 Q12 Q13
Q33 tagtad tagH Q11 Q22
Commuting Functions
accprf Gamnoexp Dh mbox1l2m1m
Expressions
d2l494sub1(stored) ex(local)
FindrNumber: n = 14, list has 17 members
v->lo[17] = 1192
v->lo[16] = 1188
v->lo[15] = 1183
v->lo[14] = 1179
v->lo[13] = 1175
v->lo[12] = 1137
v->lo[11] = 1136
v->lo[10] = 1114
v->lo[9] = 1113
v->lo[8] = 1112
v->lo[7] = 1110
v->lo[6] = 162
v->lo[5] = 157
v->lo[4] = 156
v->lo[3] = 148
v->lo[2] = 131
v->lo[1] = 106
v->lo[0] = 30
Start with 0,9,17
New: 0,4,9, *med = 156
New: 0,2,4, *med = 131
New: 0,1,2, *med = 106
New: 0,0,1, *med = 30
Renumbering problems
Called from TermRenumber
Program terminating in thread 10 at form-test.frm Line 3 -->
Max. space for expressions: 0 bytes
0.00 sec + 0.00 sec: 0.00 sec out of 0.00 sec
This can be worked around by changing the saved expression slightly, for eg, multiplying by some additional symbol. But I hope this example helps with debugging if you get a chance.
Thanks, Josh.
If I enable an interesting line that is currently commented out, first = 0
in GetFromStore()
:
then FORM doesn't crash and loads something (though I'm not sure what first
indicates):
Load bug.res;
L F = d2l494sub1;
P;
.sort
On names;
.end
FORM 5.0.0-beta.1 (Apr 4 2024, v5.0.0-beta.1-47-gd587b03-dirty) Run: Thu Apr 4 14:26:04 2024
Load bug.res;
d2l494sub1 loaded
L F = d2l494sub1;
P +s;
.sort
Time = 0.46 sec Generated terms = 40
F Terms in output = 40
Bytes used = 1243828
F =
+ accprf(M1^2*Q13^2*Q33^3*Q11*Q22 - 5*M1^2*Q13^3*Q33^2*Q11*Q22 + 8*M1^2
...
ufoGC112^3*M2^2*eM1*M1fromUFO^-3*cA2*tagtad*tagH
;
On names;
.end
Symbols
ufoGC11 ufoGC87 ufoGC112 ep(:1000) M1 M2 eM1 M1fromUFO Qtt cA1 cA2 Q12 Q13
Q33 tagtad tagH Q11 Q22
Commuting Functions
accprf Gamnoexp Dh mbox1l2m1m
Expressions
d2l494sub1(stored) F(local)
Time = 0.58 sec Generated terms = 40
F Terms in output = 40
Bytes used = 1243828
Is it perhaps the right expression?
This does seem to fix things, the output is correct. One can also load this file correctly by changing the SizeStoreCache
setup parameter. A SizeStoreCache
of 32769
(independent of the first=0
line) causes an infinite loop in FindrNumber
. 32233
causes a segfault with valgrind warnings in FindrNumber
(also independent of the first=0
line).
For the "other" issues in the last comment, I think it should suffice to force SizeStoreCache
to be a multiple of sizeof(WORD)
.
I've not managed to create a simple crashing example yet for the actual issue. The crashing example hits *position == s->toppos
twice in a row here
https://github.com/vermaseren/form/blob/d587b0321397a9e3c16bb1c06516c62f9d078f49/sources/store.c#L1653
which exposes the issue. This needs a very specific combination of term sizes and SizeStoreCache
.
Another simple workaround is to set NumStoreCaches
to 0
(to disable caching).
How do you feel about committing the uncommented line (as well as the wordsize rounding for sizestorecache)? I can't think of a particularly good way to test this change, beyond letting people test it in real calculations in the beta version for a while.
Hi everyone,
I have been running many instances of a FORM program acting on files of different dimensions. Most of them run without any issue but a handful keep crashing with the following error:
FindrNumber: n = 1, list has -1 members Start with 0,0,-1 Renumbering problems Called from TermRenumber Program terminating at step2.hel.frm Line 371 --> 14259.98 sec out of 14313.00 sec
I assumed that this was a RAM allocation error so I tried increasing the setup parameters to
: MaxTermSize 20M
: WorkSpace 3000M
as well as the available RAM (I am working on a cluster and I have to pre-allocate a fixed amount). With 128 GB of available memory the programs are still crashing.
Also, the programs stop inside a #do i=1,n loop at different values of i, so I would probably exclude a mistake in the code, especially because the vast majority of the times the program works.
Does anyone know what the problem might be or at least what this error is related to?
Thanks in advance for your help! Giulio