ggambuti commented 1 year ago

Hi everyone,

I have been running many instances of a FORM program acting on files of different dimensions. Most of them run without any issue but a handful keep crashing with the following error:

FindrNumber: n = 1, list has -1 members Start with 0,0,-1 Renumbering problems Called from TermRenumber Program terminating at step2.hel.frm Line 371 --> 14259.98 sec out of 14313.00 sec

I assumed that this was a RAM allocation error so I tried increasing the setup parameters to

: MaxTermSize 20M

: WorkSpace 3000M

as well as the available RAM (I am working on a cluster and I have to pre-allocate a fixed amount). With 128 GB of available memory the programs are still crashing.

Also, the programs stop inside a #do i=1,n loop at different values of i, so I would probably exclude a mistake in the code, especially because the vast majority of the times the program works.

Does anyone know what the problem might be or at least what this error is related to?

Thanks in advance for your help! Giulio

tueda commented 1 year ago

Thank you for your report!

The FindrNumber function looks like a subroutine to perform a binary search. It is called from many places in function TermRenumber with different buffers given as the second argument. The error message indicates that the information stored in the given buffer has something wrong (the routine tried to search from a group with -1 members, which sounds like a logical bug). To see which buffer has the problem, we need a stack backtrace.

Would it be possible for you to provide us with a minimal reproducible example?

Edit: though the routine avoids "the common bug of binary searches", somehow WORDDIF is used, like med = hi - ((WORDDIF(hi,med))/2), which downcasts a 64-bit pointer difference into a 32-bit integer. This may be a problem when the buffer space for the search is enormously large (> 2^32).

ggambuti commented 1 year ago

Hi,

thanks for the help!

Unfortunately I cannot reproduce this outside of the programs that I am running at the moment. These require many files and procedures to run, and only encounter the bug after several hours of runtime..

I will try to follow your suggestion and reduce the buffer space by splitting up the computation more than I was already doing. Hopefully this way I can avoid the problem.

vermaseren commented 1 year ago

Question: Are you running Form, TForm or ParForm? There is a slight difference in how they deal with the renumbering.

On 17 Nov 2022, at 10:06, ggambuti @.***> wrote:

Hi,

thanks for the help!

Unfortunately I cannot reproduce this outside of the programs that I am running at the moment. These require many files and procedures to run, and only encounter the bug after several hours of runtime..

I will try to follow your suggestion and reduce the buffer space by splitting up the computation more than I was already doing. Hopefully this way I can avoid the problem.

— Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/420#issuecomment-1318311218, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJPCEVGY253PG3WCLA27NLWIXYSFANCNFSM6AAAAAASCCVEPQ. You are receiving this because you are subscribed to this thread.

ggambuti commented 1 year ago

I am running standard FORM, with no parallelization.

jodavies commented 1 year ago

@ggambuti are you loading stored expressions here?

We found a very similar looking problem when loading store files from a particular diagram. Most of the diagrams run through with no problems, and a select few have a problem.

I attach some files which reproduce this issue with form 4.0, 4.1, 4.2, 4.3.

If I put a subset of the loaded expressions into my test expression, things work OK. Only if I sum all four, or 1,3 and 4, do I have the crash.

form-crash.tar.gz

vermaseren commented 1 year ago

I have not had the time to look at it in the debugger yet. But here is a workaround:

do i = 1,4

Load d2l1279subi'.res; Local testi' = d2l1279subi'; .sort:i'; Delete storage; Hide testi'; .sort:i'-2;

enddo

.sort Local test = test1+...+test4;

There is clearly a problem when loading several files simultaneously. This is rather complicated code though and I’ll have to take some time for it.

On 13 Feb 2023, at 21:48, jodavies @.***> wrote:

@ggambuti https://github.com/ggambuti are you loading stored expressions here?

We found a very similar looking problem when loading store files from a particular diagram. Most of the diagrams run through with no problems, and a select few have a problem.

I attach some files which reproduce this issue with form 4.0, 4.1, 4.2, 4.3.

If I put a subset of the loaded expressions into my test expression, things work OK. Only if I sum all four, or 1,3 and 4, do I have the crash.

form-crash.tar.gz https://github.com/vermaseren/form/files/10726170/form-crash.tar.gz — Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/420#issuecomment-1428660677, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJPCESDFRG5DYTWAHT5GQ3WXKMYPANCNFSM6AAAAAASCCVEPQ. You are receiving this because you commented.

jodavies commented 1 year ago

Hi Jos,

Thanks for this workaround suggestion, this should suffice for our code in the meantime.

Thanks, Josh.

jodavies commented 8 months ago

Hi Jos,

Now we've come across the issue when loading just a single save file, so the workaround no longer works.

If I run the following, with bug.res attached,

on names;
load bug.res;
local ex = d2l494sub1;
.end

FORM gives

TFORM 4.3.0 (Dec 23 2022, v4.3.0-10-g6cc038c) 64-bits 12 workers  Run: Wed Oct 25 11:57:48 2023
    on names;
    load bug.res;
 d2l494sub1 loaded
    local ex = d2l494sub1;
    .end

 Symbols
   ufoGC11 ufoGC87 ufoGC112 ep(:1000) M1 M2 eM1 M1fromUFO Qtt cA1 cA2 Q12 Q13
   Q33 tagtad tagH Q11 Q22
 Commuting Functions
   accprf Gamnoexp Dh mbox1l2m1m
 Expressions
   d2l494sub1(stored) ex(local)
FindrNumber: n = 14, list has 17 members
v->lo[17] = 1192
v->lo[16] = 1188
v->lo[15] = 1183
v->lo[14] = 1179
v->lo[13] = 1175
v->lo[12] = 1137
v->lo[11] = 1136
v->lo[10] = 1114
v->lo[9] = 1113
v->lo[8] = 1112
v->lo[7] = 1110
v->lo[6] = 162
v->lo[5] = 157
v->lo[4] = 156
v->lo[3] = 148
v->lo[2] = 131
v->lo[1] = 106
v->lo[0] = 30
Start with 0,9,17
New: 0,4,9, *med = 156
New: 0,2,4, *med = 131
New: 0,1,2, *med = 106
New: 0,0,1, *med = 30
Renumbering problems
Called from TermRenumber
Program terminating in thread 10 at form-test.frm Line 3 -->
Max. space for expressions:                   0 bytes
  0.00 sec + 0.00 sec: 0.00 sec out of 0.00 sec

This can be worked around by changing the saved expression slightly, for eg, multiplying by some additional symbol. But I hope this example helps with debugging if you get a chance.

Thanks, Josh.

bug.zip

tueda commented 2 months ago

If I enable an interesting line that is currently commented out, first = 0 in GetFromStore():

https://github.com/vermaseren/form/blob/d587b0321397a9e3c16bb1c06516c62f9d078f49/sources/store.c#L1682-L1688

then FORM doesn't crash and loads something (though I'm not sure what first indicates):

Load bug.res;
L F = d2l494sub1;
P;
.sort
On names;
.end

FORM 5.0.0-beta.1 (Apr  4 2024, v5.0.0-beta.1-47-gd587b03-dirty)  Run: Thu Apr  4 14:26:04 2024
    Load bug.res;
 d2l494sub1 loaded
    L F = d2l494sub1;
    P +s;
    .sort

Time =       0.46 sec    Generated terms =         40
               F         Terms in output =         40
                         Bytes used      =    1243828

   F =
       + accprf(M1^2*Q13^2*Q33^3*Q11*Q22 - 5*M1^2*Q13^3*Q33^2*Q11*Q22 + 8*M1^2
      ...
      ufoGC112^3*M2^2*eM1*M1fromUFO^-3*cA2*tagtad*tagH
      ;

    On names;
    .end

 Symbols
   ufoGC11 ufoGC87 ufoGC112 ep(:1000) M1 M2 eM1 M1fromUFO Qtt cA1 cA2 Q12 Q13 
   Q33 tagtad tagH Q11 Q22
 Commuting Functions
   accprf Gamnoexp Dh mbox1l2m1m
 Expressions
   d2l494sub1(stored) F(local)

Time =       0.58 sec    Generated terms =         40
               F         Terms in output =         40
                         Bytes used      =    1243828

Is it perhaps the right expression?

jodavies commented 2 months ago

This does seem to fix things, the output is correct. One can also load this file correctly by changing the SizeStoreCache setup parameter. A SizeStoreCache of 32769 (independent of the first=0 line) causes an infinite loop in FindrNumber. 32233 causes a segfault with valgrind warnings in FindrNumber (also independent of the first=0 line).

484 looks to be fixed by the uncommenting also.

jodavies commented 2 months ago

For the "other" issues in the last comment, I think it should suffice to force SizeStoreCache to be a multiple of sizeof(WORD).

I've not managed to create a simple crashing example yet for the actual issue. The crashing example hits *position == s->toppos twice in a row here https://github.com/vermaseren/form/blob/d587b0321397a9e3c16bb1c06516c62f9d078f49/sources/store.c#L1653 which exposes the issue. This needs a very specific combination of term sizes and SizeStoreCache.

tueda commented 2 months ago

Another simple workaround is to set NumStoreCaches to 0 (to disable caching).

jodavies commented 1 month ago

How do you feel about committing the uncommented line (as well as the wordsize rounding for sizestorecache)? I can't think of a particularly good way to test this change, beyond letting people test it in real calculations in the beta version for a while.

vermaseren / form

FindrNumber/TermRenumber related crash #420

: MaxTermSize 20M

: WorkSpace 3000M

do i = 1,4

enddo

484 looks to be fixed by the uncommenting also.