vermaseren / form

The FORM project for symbolic manipulation of very big expressions
GNU General Public License v3.0
1.13k stars 135 forks source link

Unexpected code in ReNumber #211

Open jodavies opened 7 years ago

jodavies commented 7 years ago

Hello,

I was writing a small script to see to what extent I could increase the test coverage in sort.c and compress.c. I have something like the following:

#-

#: TermsInSmall 128
#: LargePatches 16
#: FilePatches 4
*#: SubTermsInSmall 64
#: SubLargePatches 8
#: SubFilePatches 2

Off Statistics;

CFunction f,g;
Symbol x,y;

* 128*16=2048 terms cause a sort of the large buffer to disk.
* multiples of 2048*4=8192 terms cause a stage 4 sort
#define NTERMS "40001"

#define ARGNTERMS "2001"

Local test1 = <f(1)>+...+<f(`NTERMS')>;
Local test2 = g(<f(1)>+...+<f(`ARGNTERMS')>);
.sort

* Cancel all terms, but keep distance so that most terms only cancel in the final sort
Identify f(x?) = f(x) - f(`NTERMS'-x+1);
Argument g;
    Identify f(x?) = f(x) - f(`ARGNTERMS'-x+1);
EndArgument;

Print +s;
.end

If one uncomments the #: SubTermsInSmall 64 setting, tform crashes with Unexpected code in ReNumber. form does not crash, but test2 contains many f(...) terms that are not inside g.

Thanks, Josh

jodavies commented 7 years ago
==7336== Conditional jump or move depends on uninitialised value(s)
==7336==    at 0x5095C5: TestSub (proces.c:1622)
==7336==    by 0x501D66: Generator (proces.c:3051)
==7336==    by 0x503B9E: Generator (proces.c:3762)
==7336==    by 0x5040BA: Generator (proces.c:3930)
==7336==    by 0x50ACE9: Processor (proces.c:404)
==7336==    by 0x44ABB5: DoExecute (execute.c:838)
==7336==    by 0x4690CF: ExecModule (module.c:274)
==7336==    by 0x4F836E: PreProcessor (pre.c:962)
==7336==    by 0x544884: main (startup.c:1605)
tueda commented 7 years ago

It must be an old bug. I got many f's with FORM 4.1 (Oct 25 2013).

vermaseren commented 7 years ago

I’ll go after it. Specially with the crash (with 201 instead of 2001).

Jos

On 30 jun. 2017, at 17:12, Takahiro Ueda notifications@github.com wrote:

It must be an old bug. I got many f's with FORM 4.1 (Oct 25 2013).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/211#issuecomment-312294232, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLxEvvcV9C8YRTsSYarHFUENpEZ9tVSks5sJRBagaJpZM4OKqxS.

jodavies commented 7 years ago

Great, that seems to work.

Now, if I add to the first module:

#$test3dol = <f(1)>+...+<f(`NTERMS')>;

and to the second module:

#inside $test3dol
    Identify f(x?) = f(x) - f(`NTERMS'-x+1);
#endinside

form has no problems, but tform -w4 crashes about 60% of the time, with various errors:

Unexpected code in ReNumber
Program terminating in thread 1 at sorting.frm Line 33 --> 
0FillInputGZIP: Read error during uncompressed sort.
0++Reading 4295000 bytes at position    1366592
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 418958 bytes
Program terminating in thread 0 at sorting.frm Line 23 --> 
Program terminating in thread 4 at sorting.frm Line 23 --> 

EDIT: For this, at some point before the crash, valgrind gives


==18016== Syscall param read(buf) points to unaddressable byte(s)
==18016==    at 0x5B8551D: ??? (syscall-template.S:84)
==18016==    by 0x520956: read (unistd.h:44)
==18016==    by 0x520956: Uread (unixfile.c:134)
==18016==    by 0x50F375: ReadFile (tools.c:1158)
==18016==    by 0x42D7A5: FillInputGZIP (compress.c:694)
==18016==    by 0x4ED173: PutIn (sort.c:1260)
==18016==    by 0x4F153F: MergePatches (sort.c:3656)
==18016==    by 0x4F3798: EndSort (sort.c:1066)
==18016==    by 0x430DA8: CatchDollar (dollar.c:117)
==18016==    by 0x4BB230: PreProcessor (pre.c:1050)
==18016==    by 0x4F88A0: main (startup.c:1605)
==18016==  Address 0x671d52c0 is 0 bytes after a block of size 2,560,640 alloc'd
==18016==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==18016==    by 0x510762: Malloc1 (tools.c:2236)
==18016==    by 0x5111E0: TermMallocAddMemory (tools.c:2470)
==18016==    by 0x45C57B: Normalize (normal.c:220)
==18016==    by 0x4C24AF: Generator (proces.c:3064)
==18016==    by 0x430D61: CatchDollar (dollar.c:112)
==18016==    by 0x4BB230: PreProcessor (pre.c:1050)
==18016==    by 0x4F88A0: main (startup.c:1605)
tueda commented 7 years ago

Running valgrind tvorm -w2 for

#:FilePatches 4
#:SubTermsInSmall 64
#:SubLargePatches 8
#:SubFilePatches 2
CF f;
#define NTERMS "5000"
#$x = <f(1)>+...+<f(`NTERMS')>;
.end

gives me

StageSort in thread 0
StageSort in thread 0
==11421== Syscall param read(buf) points to unaddressable byte(s)
==11421==    at 0x32E1A0E82D: ??? (in /lib64/libpthread-2.12.so)
==11421==    by 0x51C1B3: Uread (unixfile.c:134)
==11421==    by 0x50B031: ReadFile (tools.c:1158)
==11421==    by 0x42CF56: FillInputGZIP (compress.c:694)
==11421==    by 0x4E95DB: PutIn (sort.c:1260)
==11421==    by 0x4ED90D: MergePatches (sort.c:3656)
==11421==    by 0x4EFB3C: EndSort (sort.c:1066)
==11421==    by 0x4304D2: CatchDollar (dollar.c:117)
==11421==    by 0x4B8301: PreProcessor (pre.c:1050)
==11421==    by 0x4F4A43: main (startup.c:1605)
==11421==  Address 0x37e9c640 is not stack'd, malloc'd or (recently) free'd
==11421==
0FillInputGZIP: Read error during uncompressed sort.
0++Reading 4295000 bytes at position     106280
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 418958 bytes
Program terminating in thread 0 at test.frm Line 7 -->
==11421== Invalid read of size 4
==11421==    at 0x50F5F8: Crash (tools.c:3711)
==11421==    by 0x4F4070: Terminate (startup.c:1707)
==11421==    by 0x4E9618: PutIn (sort.c:1266)
==11421==    by 0x4ED90D: MergePatches (sort.c:3656)
==11421==    by 0x4EFB3C: EndSort (sort.c:1066)
==11421==    by 0x4304D2: CatchDollar (dollar.c:117)
==11421==    by 0x4B8301: PreProcessor (pre.c:1050)
==11421==    by 0x4F4A43: main (startup.c:1605)
==11421==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

Now 100% it occurs (though it is not in a worker).

vermaseren commented 7 years ago

Looks quite useful. I spent a few hours yesterday on this stuff and the main conclusion was that there was a counting error in the number of patches, because somewhere an end of expression seems to not have been seen, unless a patch was written over another patch instead of its intended spot. But with expressions this size it is not easy to find what causes it. In the example, it all played in a single worker. Same with the example that you give. The dollar expands completely inside one worker.

Jos

On 1 jul. 2017, at 21:39, Takahiro Ueda notifications@github.com wrote:

Running valgrind tvorm -w2 for

:FilePatches 4

:SubTermsInSmall 64

:SubLargePatches 8

:SubFilePatches 2

CF f;

define NTERMS "5000"

$x = <f(1)>+...+<f(`NTERMS')>;

.end gives me

StageSort in thread 0 StageSort in thread 0 ==11421== Syscall param read(buf) points to unaddressable byte(s) ==11421== at 0x32E1A0E82D: ??? (in /lib64/libpthread-2.12.so) ==11421== by 0x51C1B3: Uread (unixfile.c:134) ==11421== by 0x50B031: ReadFile (tools.c:1158) ==11421== by 0x42CF56: FillInputGZIP (compress.c:694) ==11421== by 0x4E95DB: PutIn (sort.c:1260) ==11421== by 0x4ED90D: MergePatches (sort.c:3656) ==11421== by 0x4EFB3C: EndSort (sort.c:1066) ==11421== by 0x4304D2: CatchDollar (dollar.c:117) ==11421== by 0x4B8301: PreProcessor (pre.c:1050) ==11421== by 0x4F4A43: main (startup.c:1605) ==11421== Address 0x37e9c640 is not stack'd, malloc'd or (recently) free'd ==11421== 0FillInputGZIP: Read error during uncompressed sort. 0++Reading 4295000 bytes at position 106280 PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 418958 bytes Program terminating in thread 0 at test.frm Line 7 --> ==11421== Invalid read of size 4 ==11421== at 0x50F5F8: Crash (tools.c:3711) ==11421== by 0x4F4070: Terminate (startup.c:1707) ==11421== by 0x4E9618: PutIn (sort.c:1266) ==11421== by 0x4ED90D: MergePatches (sort.c:3656) ==11421== by 0x4EFB3C: EndSort (sort.c:1066) ==11421== by 0x4304D2: CatchDollar (dollar.c:117) ==11421== by 0x4B8301: PreProcessor (pre.c:1050) ==11421== by 0x4F4A43: main (startup.c:1605) ==11421== Address 0x0 is not stack'd, malloc'd or (recently) free'd Now 100% it occurs (though it is not in a worker).

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/211#issuecomment-312451751, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLxEtCiO7VcmLqXwdn5Sm8HNEc_NLuAks5sJqBXgaJpZM4OKqxS.

tueda commented 7 years ago

The minimum seems to be 129 terms,

TFORM 4.1 (Jun 30 2017, v4.1-20131025-382-gc169281) 64-bits 0 workers  Run: Sat Jul  1 23:46:29 2017
    #:FilePatches 4
    #:SubTermsInSmall 16
    #:SubLargePatches 1
    #:SubFilePatches 4
    S x;
    CF f;
    #define NTERMS "129"
    L F = f(<x^1>+...+<x^`NTERMS'>);
    ModuleOption noparallel;
    .end
StageSort in thread 0
StageSort in thread 0
==12127== Syscall param read(buf) points to unaddressable byte(s)
==12127==    at 0x32E1A0E810: __read_nocancel (in /lib64/libpthread-2.12.so)
==12127==    by 0x51C1B3: Uread (unixfile.c:134)
==12127==    by 0x50B031: ReadFile (tools.c:1158)
==12127==    by 0x42CF56: FillInputGZIP (compress.c:694)
==12127==    by 0x4E95DB: PutIn (sort.c:1260)

but unfortunately it doesn't crash in spite of the memory bug.

tueda commented 7 years ago

Two things:

In the following log, is file->POsize of file->handle=4 OK??

TFORM 4.1 (Jul  1 2017, v4.1-20131025-382-gc169281-dirty) 64-bits 0 workers  Run: Sun Jul  2 00:31:33 2017
    #:MaxTermSize 600
    #:FilePatches 4
    #:SubTermsInSmall 16
    #:SubLargePatches 1
    #:SubFilePatches 4
    S x;
    CF f;
    #define NTERMS "65"
    L F = f(<x^1>+...+<x^`NTERMS'>);
    ModuleOption noparallel;
    .end
StageSort in thread 0
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=4 file->POsize=4373800]
[PutIn: file->handle=4 file->POsize=4373800]
==30515== Syscall param read(buf) points to unaddressable byte(s)
==30515==    at 0x32E1A0E810: __read_nocancel (in /lib64/libpthread-2.12.so)
==30515==    by 0x51A799: Uread (unixfile.c:134)
==30515==    by 0x509618: ReadFile (tools.c:1158)
==30515==    by 0x4E831B: PutIn (sort.c:1273)
==30515==    by 0x4EC2A3: MergePatches (sort.c:3658)
==30515==    by 0x4EE13D: EndSort (sort.c:1066)
==30515==    by 0x4C477F: InFunction (proces.c:2057)
==30515==    by 0x4BFA19: Generator (proces.c:3759)
==30515==    by 0x4C00A0: Generator (proces.c:3931)
==30515==    by 0x4C1AA3: Processor (proces.c:404)
==30515==    by 0x43A4E8: DoExecute (execute.c:838)
==30515==    by 0x451866: ExecModule (module.c:274)
==30515==  Address 0x1cb453d8 is 0 bytes after a block of size 5,100,440 alloc'd
==30515==    at 0x4A05A57: malloc (vg_replace_malloc.c:299)
==30515==    by 0x50A950: Malloc1 (tools.c:2236)
==30515==    by 0x4E4EA2: AllocSort (setfile.c:901)
==30515==    by 0x4E814D: NewSort (sort.c:619)
==30515==    by 0x4C46DB: InFunction (proces.c:2033)
==30515==    by 0x4BFA19: Generator (proces.c:3759)
==30515==    by 0x4C00A0: Generator (proces.c:3931)
==30515==    by 0x4C1AA3: Processor (proces.c:404)
==30515==    by 0x43A4E8: DoExecute (execute.c:838)
==30515==    by 0x451866: ExecModule (module.c:274)
==30515==    by 0x4B6D9B: PreProcessor (pre.c:962)
==30515==    by 0x4F302A: main (startup.c:1605)

The corresponding vorm log:

FORM 4.1 (Jul  1 2017, v4.1-20131025-382-gc169281-dirty) 64-bits  Run: Sun Jul  2 00:38:11 2017
    #:MaxTermSize 600
    #:FilePatches 4
    #:SubTermsInSmall 16
    #:SubLargePatches 1
    #:SubFilePatches 4
    S x;
    CF f;
    #define NTERMS "65"
    L F = f(<x^1>+...+<x^`NTERMS'>);
    ModuleOption noparallel;
    .end
StageSort
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=4 file->POsize=100000]
[PutIn: file->handle=4 file->POsize=100000]
StageSort
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=4 file->POsize=100000]
[PutIn: file->handle=4 file->POsize=100000]
tueda commented 7 years ago

The first example causes memory leaks: https://travis-ci.org/tueda/form/jobs/249641044#L4542. (Should we reopen the issue?)

vermaseren commented 7 years ago

I guess yes. I am still looking at the “GZIP…..” and sometimes it crashes reading, sometimes the reading goes ‘fine’ but it crashes in ReNumber. I have also put a print somewhere else and sometimes it crashes there. But it seems all related. The bug is extremely crazy though. It crashes in the example in the first module, but only if the second module is present and has code in it. ?????????

Jos

On 3 jul. 2017, at 16:42, Takahiro Ueda notifications@github.com wrote:

The first example causes memory leaks: https://travis-ci.org/tueda/form/jobs/249641044#L4542 https://travis-ci.org/tueda/form/jobs/249641044#L4542. (Should we reopen the issue?)

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/211#issuecomment-312663605, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLxEn-jvxiiLDtMTeNkh0GK-inVs1eoks5sKP3PgaJpZM4OKqxS.

tueda commented 7 years ago

Here is another spam of valgrind log (vorm) with increasing SortIOSize:

FORM 4.1 (Jul  3 2017, v4.1-20131025-383-gd8ae7a6-dirty) 64-bits  Run: Mon Jul  3 21:05:02 2017
    #:SortIOSize 10M
    #:MaxTermSize 600
    #:SubTermsInSmall 16
    #:SubLargePatches 1
    #:SubFilePatches 2
    S x;
    CF f;
    #define NTERMS "65"
    L F = f(<x^1>+...+<x^`NTERMS'>);
    P;
    .end
StageSort
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=3 file->POsize=298800]
[PutIn: file->handle=4 file->POsize=10000000]
==5619== Syscall param read(buf) points to unaddressable byte(s)
==5619==    at 0x32E12DB670: __read_nocancel (in /lib64/libc-2.12.so)
==5619==    by 0x50E343: Uread (unixfile.c:134)
==5619==    by 0x4FE08C: ReadFile (tools.c:1158)
==5619==    by 0x42B0B6: FillInputGZIP (compress.c:694)
==5619==    by 0x4DDD98: PutIn (sort.c:1260)
==5619==    by 0x4E1AA0: MergePatches (sort.c:3656)
==5619==    by 0x4E391C: EndSort (sort.c:1066)
==5619==    by 0x4BC6EA: InFunction (proces.c:2057)
==5619==    by 0x4B8007: Generator (proces.c:3759)
==5619==    by 0x4B8463: Generator (proces.c:3931)
==5619==    by 0x4B9BE3: Processor (proces.c:404)
==5619==    by 0x43762D: DoExecute (execute.c:838)
==5619==  Address 0x15469408 is 0 bytes after a block of size 5,100,488 alloc'd
==5619==    at 0x4A05A57: malloc (vg_replace_malloc.c:299)
==5619==    by 0x4FF227: Malloc1 (tools.c:2236)
==5619==    by 0x4DB0B4: AllocSort (setfile.c:901)
==5619==    by 0x4DDBF0: NewSort (sort.c:619)
==5619==    by 0x4BC64C: InFunction (proces.c:2033)
==5619==    by 0x4B8007: Generator (proces.c:3759)
==5619==    by 0x4B8463: Generator (proces.c:3931)
==5619==    by 0x4B9BE3: Processor (proces.c:404)
==5619==    by 0x43762D: DoExecute (execute.c:838)
==5619==    by 0x44DAAA: ExecModule (module.c:274)
==5619==    by 0x4AFB2D: PreProcessor (pre.c:962)
==5619==    by 0x4E8747: main (startup.c:1605)
==5619== 
[PutIn: file->handle=4 file->POsize=10000000]

It is very strange that PutIn() is called with file->POsize=10000000, i.e., SortIOSize. Then ReadFile() overflows from the buffer allocated by AllocSort for the function argument (only 5100488 bytes, valgrind says). GDB shows it is allocated with AllocSort (LargeSize=4000000, SmallSize=500000, SmallEsize=800000, TermsInSmall=16, MaxPatches=1, MaxFpatches=4, IOsize=32768). For TFORM, instead of SortIOSize, the value calculated from AM.S0->file.POsize is used (see threads.c:541).

vermaseren commented 7 years ago

What this seems to indicate that somehow the function/dollar POsize is, under certain circumstances, borrowed from the level 0 one. I’ll see what I can find.

Jos

On 3 jul. 2017, at 21:20, Takahiro Ueda notifications@github.com wrote:

Here is another spam of valgrind log (vorm) with increasing SortIOSize:

FORM 4.1 (Jul 3 2017, v4.1-20131025-383-gd8ae7a6-dirty) 64-bits Run: Mon Jul 3 21:05:02 2017

:SortIOSize 10M

#:MaxTermSize 600
#:SubTermsInSmall 16
#:SubLargePatches 1
#:SubFilePatches 2
S x;
CF f;
#define NTERMS "65"
L F = f(<x^1>+...+<x^`NTERMS'>);
P;
.end

StageSort [PutIn: file->handle=3 file->POsize=298800] [PutIn: file->handle=3 file->POsize=298800] [PutIn: file->handle=3 file->POsize=298800] [PutIn: file->handle=3 file->POsize=298800] [PutIn: file->handle=4 file->POsize=10000000] ==5619== Syscall param read(buf) points to unaddressable byte(s) ==5619== at 0x32E12DB670: __read_nocancel (in /lib64/libc-2.12.so) ==5619== by 0x50E343: Uread (unixfile.c:134) ==5619== by 0x4FE08C: ReadFile (tools.c:1158) ==5619== by 0x42B0B6: FillInputGZIP (compress.c:694) ==5619== by 0x4DDD98: PutIn (sort.c:1260) ==5619== by 0x4E1AA0: MergePatches (sort.c:3656) ==5619== by 0x4E391C: EndSort (sort.c:1066) ==5619== by 0x4BC6EA: InFunction (proces.c:2057) ==5619== by 0x4B8007: Generator (proces.c:3759) ==5619== by 0x4B8463: Generator (proces.c:3931) ==5619== by 0x4B9BE3: Processor (proces.c:404) ==5619== by 0x43762D: DoExecute (execute.c:838) ==5619== Address 0x15469408 is 0 bytes after a block of size 5,100,488 alloc'd ==5619== at 0x4A05A57: malloc (vg_replace_malloc.c:299) ==5619== by 0x4FF227: Malloc1 (tools.c:2236) ==5619== by 0x4DB0B4: AllocSort (setfile.c:901) ==5619== by 0x4DDBF0: NewSort (sort.c:619) ==5619== by 0x4BC64C: InFunction (proces.c:2033) ==5619== by 0x4B8007: Generator (proces.c:3759) ==5619== by 0x4B8463: Generator (proces.c:3931) ==5619== by 0x4B9BE3: Processor (proces.c:404) ==5619== by 0x43762D: DoExecute (execute.c:838) ==5619== by 0x44DAAA: ExecModule (module.c:274) ==5619== by 0x4AFB2D: PreProcessor (pre.c:962) ==5619== by 0x4E8747: main (startup.c:1605) ==5619== [PutIn: file->handle=4 file->POsize=10000000] It is very strange that PutIn() is called with file->POsize=10000000, i.e., SortIOSize. Then ReadFile() overflows from the buffer allocated by AllocSort for the function argument (only 5100488 bytes, valgrind says). GDB shows it is allocated with AllocSort (LargeSize=4000000, SmallSize=500000, SmallEsize=800000, TermsInSmall=16, MaxPatches=1, MaxFpatches=4, IOsize=32768). For TFORM, instead of SortIOSize, the value calculated from AM.S0->file.POsize is used (see threads.c:541 https://github.com/vermaseren/form/blob/d8ae7a65e86d13024332389730a5abe790be83e8/sources/threads.c#L541-L547).

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/211#issuecomment-312717808, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLxEtTYxpUTKU9_4ZYMnVwWj82FZLSgks5sKT7tgaJpZM4OKqxS.

tueda commented 7 years ago

Actually, the problematic PutIn() is called with AR.FoStage4[0]:

Breakpoint 1, PutIn (file=file@entry=0x797d30 <A+7664>, position=0x7fff4e4ba520, buffer=0x7fff4e4bb8e0,
    take=take@entry=0x7fffffffcf40, npat=npat@entry=0) at sort.c:1242

(gdb) p &AR.FoStage4[0]
$14 = (FILEHANDLE *) 0x797d30 <A+7664>
jodavies commented 7 years ago

If you enable gzipdebug do you see the same pattern of events as in https://github.com/vermaseren/form/issues/95#issuecomment-308990237 ?

vermaseren commented 7 years ago

Yes, it reads from the s4b.0 file which involves staged sorting. Maybe (just maybe) there is just one setting for the staged sort because it can only occur one at a time. But the writing and the other buffers belong to the higher level sorting (functions/$’s). I am investigating those possibilities.

Jos

On 3 jul. 2017, at 21:30, Takahiro Ueda notifications@github.com wrote:

Actually, the problematic PutIn() is called with AR.FoStage4[0]:

Breakpoint 1, PutIn (file=file@entry=0x797d30 <A+7664>, position=0x7fff4e4ba520, buffer=0x7fff4e4bb8e0, take=take@entry=0x7fffffffcf40, npat=npat@entry=0) at sort.c:1242

(gdb) p &AR.FoStage4[0] $14 = (FILEHANDLE *) 0x797d30 <A+7664> — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/211#issuecomment-312719344, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLxEt9bvCCAQuJcLrk5MGQ-KM4fdqKQks5sKUFdgaJpZM4OKqxS.

tueda commented 7 years ago

@jodavies No, I can see only ++Reading 795200 bytes at position 2116 and similar lines.

tueda commented 7 years ago

OK. I didn't enable GZIPDEBUG in sort.c. But except that I can get Valgrind errors, it looks fine...

vermaseren commented 7 years ago

OK, I think I start understanding what happens. At sorting level 0 we have big buffers and we have AR.FoStage4[] with those settings. The settings for functions and dollars are considerably smaller. In the dollar FORM works with those smaller settings, because it was not envisioned that you send functions and dollars into stage4. This would be outrageously inefficient because each object like that would need file accesses. Keep in mind that this was created originally just for functions. Before version 3 the dollars did not exist. Now, the first three stages in the sort for the functions/dollars take place with the smaller settings, but now, when the stage4 enters, it looks like we get a funny mixture of buffer sizes which can cause the reading to think that it needs to read more bytes than available on the file (not really a problem) into a buffer that may be too small for all the stuff requested. This can cause various types of problems. One can be a read error. Alternatively there can be another buffer after this and if it reads a patch from the middle of the file it will read too many bytes and overwrite this other buffer. The rest is mayhem.

This is how far I understand it at the moment. I have to think carefully how to give each higher level a somewhat different set of buffers.

For the moment (good strategy anyway): Stay away from stage4 in dollars and functions.

Jos

On 3 jul. 2017, at 21:30, Takahiro Ueda notifications@github.com wrote:

Actually, the problematic PutIn() is called with AR.FoStage4[0]:

Breakpoint 1, PutIn (file=file@entry=0x797d30 <A+7664>, position=0x7fff4e4ba520, buffer=0x7fff4e4bb8e0, take=take@entry=0x7fffffffcf40, npat=npat@entry=0) at sort.c:1242

(gdb) p &AR.FoStage4[0] $14 = (FILEHANDLE *) 0x797d30 <A+7664> — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/211#issuecomment-312719344, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLxEt9bvCCAQuJcLrk5MGQ-KM4fdqKQks5sKUFdgaJpZM4OKqxS.

tueda commented 5 years ago

With the current master branch a53fd8fed037a4b8978b6e1b07f86451810fb7ab, I locally (Ubuntu on Windows, Xenial, Homebrew gcc 5.5.0_4) have a TFORM problem:

cd checking
./check.rb ../sources/tform Issue211

always leading to

StageSort in thread 1
1FillInputGZIP: Read error during uncompressed sort.
1++Reading 131170000 bytes at position          0
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 7D17ED0 bytes

I didn't get the same error with previous versions.

Confusingly, on another machine (CentOS 6.10, gcc Red Hat 7.3.1-5), this error occurs relatively rarely.

The error did not occur on Travis CI: https://travis-ci.org/vermaseren/form/jobs/436825672, and it did fail on Circle CI once: https://circleci.com/gh/tueda/form/156.

Does anyone also hit the same gzip error with the test Issue211?

tueda commented 5 years ago

Ah, maybe this is related to the fact that TFORM now consumes a crazy amount of memory even if it performs nothing (~ 8GB for tform -w4) due to the increases in fsizes.h. FORM requires big machines.

tueda commented 5 years ago

The change of memory usage for doing nothing:

form: 391M -> 2226M
tform -w4: 2001M -> 7400M

FORM is now really heavy software in a sense of memory allocation. This may explain why the 32-bit test never succeeds (over the 4GB limit).

tueda commented 5 years ago

A valgrind dump:

==17684== Memcheck, a memory error detector
==17684== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==17684== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==17684== Command: ./tvorm -w4 1.frm
==17684== 
==17684== error calling PR_SET_PTRACER, vgdb might block
==17684== Warning: set address range perms: large range [0x395d8040, 0xbe5a8a78) (undefined)
==17684== Warning: set address range perms: large range [0xbe5a9040, 0xe585a8f8) (undefined)
==17684== Warning: set address range perms: large range [0xe585b040, 0x10cb0c8f8) (undefined)
==17684== Warning: set address range perms: large range [0x10cb0d040, 0x133dbe8f8) (undefined)
==17684== Warning: set address range perms: large range [0x133dbf040, 0x15b0708f8) (undefined)
==17684== Warning: set address range perms: large range [0x15b071040, 0x178d47840) (undefined)
==17684== Warning: set address range perms: large range [0x178d48040, 0x196a1e840) (undefined)
==17684== Warning: set address range perms: large range [0x196a1f040, 0x1b46f5840) (undefined)
TFORM 4.2.0 (Oct  4 2018, v4.2.0-69-g766ae40) 64-bits 4 workers  Run: Thu Oct  4 22:10:32 2018
    #: TermsInSmall 128
    #: LargePatches 16
    #: FilePatches 4
    #: SubTermsInSmall 64
    #: SubLargePatches 8
    #: SubFilePatches 2

    Off stats;
    CF f,g;
    S x,y;

    #define NTERMS "40001"
    #define ARGNTERMS "2001"

    L F = g(<f(1)>+...+<f(`ARGNTERMS')>);
    .sort

    id f(x?) = f(x) - f(`NTERMS'-x+1);
    argument g;
      id f(x?) = f(x) - f(`ARGNTERMS'-x+1);
    endargument;
    .end
StageSort in thread 1
==17684== Warning: set address range perms: large range [0x1d01b5040, 0x1ef614b80) (undefined)
==17684== Warning: set address range perms: large range [0x1ef615040, 0x20ea74b80) (undefined)
==17684== Thread 2:
==17684== Syscall param read(buf) points to unaddressable byte(s)
==17684==    at 0x5BB751D: ??? (syscall-template.S:84)
==17684==    by 0x523411: Uread (unixfile.c:134)
==17684==    by 0x5119D2: ReadFile (tools.c:1158)
==17684==    by 0x42DCFA: FillInputGZIP (compress.c:708)
==17684==    by 0x4EFCD7: PutIn (sort.c:1276)
==17684==    by 0x4F4064: MergePatches (sort.c:3734)
==17684==    by 0x4F61FD: EndSort (sort.c:1074)
==17684==    by 0x409838: execarg (argument.c:680)
==17684==    by 0x4C5EA4: Generator (proces.c:3421)
==17684==    by 0x5255D2: RunThread (threads.c:1386)
==17684==    by 0x5BAE6B9: start_thread (pthread_create.c:333)
==17684==    by 0x5ECB41C: clone (clone.S:109)
==17684==  Address 0x37967970 is 0 bytes after a block of size 9,443,632 alloc'd
==17684==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==17684==    by 0x512D3F: Malloc1 (tools.c:2250)
==17684==    by 0x4EC81E: AllocSort (setfile.c:904)
==17684==    by 0x4EFB30: NewSort (sort.c:624)
==17684==    by 0x4CBD93: InFunction (proces.c:2066)
==17684==    by 0x4C6E28: Generator (proces.c:3815)
==17684==    by 0x4C74D2: Generator (proces.c:3987)
==17684==    by 0x5255D2: RunThread (threads.c:1386)
==17684==    by 0x5BAE6B9: start_thread (pthread_create.c:333)
==17684==    by 0x5ECB41C: clone (clone.S:109)
==17684== 
1FillInputGZIP: Read error during uncompressed sort.
1++Reading 131170000 bytes at position          0
PutIn: We have RetCode = FFFFFFFFFFFFFFFF while reading 7D17ED0 bytes
Program terminating in thread 1 at 1.frm Line 21 --> 
==17684== Invalid read of size 4
==17684==    at 0x51607A: Crash (tools.c:3767)
==17684==    by 0x4FA9EF: Terminate (startup.c:1721)
==17684==    by 0x4EFD14: PutIn (sort.c:1282)
==17684==    by 0x4F4064: MergePatches (sort.c:3734)
==17684==    by 0x4F61FD: EndSort (sort.c:1074)
==17684==    by 0x409838: execarg (argument.c:680)
==17684==    by 0x4C5EA4: Generator (proces.c:3421)
==17684==    by 0x5255D2: RunThread (threads.c:1386)
==17684==    by 0x5BAE6B9: start_thread (pthread_create.c:333)
==17684==    by 0x5ECB41C: clone (clone.S:109)
==17684==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==17684== 
  0.75 sec + 0.25 sec: 1.00 sec out of 0.96 sec
==17684== 
==17684== HEAP SUMMARY:
==17684==     in use at exit: 8,627,974,160 bytes in 338 blocks
==17684==   total heap usage: 417 allocs, 79 frees, 8,647,379,520 bytes allocated
==17684== 
==17684== LEAK SUMMARY:
==17684==    definitely lost: 0 bytes in 0 blocks
==17684==    indirectly lost: 0 bytes in 0 blocks
==17684==      possibly lost: 288 bytes in 1 blocks
==17684==    still reachable: 8,627,973,872 bytes in 337 blocks
==17684==         suppressed: 0 bytes in 0 blocks
==17684== Rerun with --leak-check=full to see details of leaked memory
==17684== 
==17684== For counts of detected and suppressed errors, rerun with: -v
==17684== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
tueda commented 5 years ago

Another dump, the same problem and same environment, but different errors:

==17751== Memcheck, a memory error detector
==17751== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==17751== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==17751== Command: ./tvorm -w4 1.frm
==17751== 
==17751== error calling PR_SET_PTRACER, vgdb might block
==17751== Warning: set address range perms: large range [0x395d8040, 0xbe5a8a78) (undefined)
==17751== Warning: set address range perms: large range [0xbe5a9040, 0xe585a8f8) (undefined)
==17751== Warning: set address range perms: large range [0xe585b040, 0x10cb0c8f8) (undefined)
==17751== Warning: set address range perms: large range [0x10cb0d040, 0x133dbe8f8) (undefined)
==17751== Warning: set address range perms: large range [0x133dbf040, 0x15b0708f8) (undefined)
==17751== Warning: set address range perms: large range [0x15b071040, 0x178d47840) (undefined)
==17751== Warning: set address range perms: large range [0x178d48040, 0x196a1e840) (undefined)
==17751== Warning: set address range perms: large range [0x196a1f040, 0x1b46f5840) (undefined)
TFORM 4.2.0 (Oct  4 2018, v4.2.0-69-g766ae40) 64-bits 4 workers  Run: Thu Oct  4 22:13:09 2018
    #: TermsInSmall 128
    #: LargePatches 16
    #: FilePatches 4
    #: SubTermsInSmall 64
    #: SubLargePatches 8
    #: SubFilePatches 2

    Off stats;
    CF f,g;
    S x,y;

    #define NTERMS "40001"
    #define ARGNTERMS "2001"

    L F = g(<f(1)>+...+<f(`ARGNTERMS')>);
    .sort

    id f(x?) = f(x) - f(`NTERMS'-x+1);
    argument g;
      id f(x?) = f(x) - f(`ARGNTERMS'-x+1);
    endargument;
    .end
==17751== Thread 2:
==17751== Conditional jump or move depends on uninitialised value(s)
==17751==    at 0x4F357F: SplitMerge (sort.c:3251)
==17751==    by 0x4F35DD: SplitMerge (sort.c:3257)
==17751==    by 0x4F35DD: SplitMerge (sort.c:3257)
==17751==    by 0x4F35DD: SplitMerge (sort.c:3257)
==17751==    by 0x4F35F5: SplitMerge (sort.c:3258)
==17751==    by 0x4F35DD: SplitMerge (sort.c:3257)
==17751==    by 0x4F5310: StoreTerm (sort.c:4357)
==17751==    by 0x4C5732: Generator (proces.c:3190)
==17751==    by 0x4C74D2: Generator (proces.c:3987)
==17751==    by 0x40973C: execarg (argument.c:658)
==17751==    by 0x4C5EA4: Generator (proces.c:3421)
==17751==    by 0x5255D2: RunThread (threads.c:1386)
==17751== 
==17751== Conditional jump or move depends on uninitialised value(s)
==17751==    at 0x4F3893: SplitMerge (sort.c:3329)
==17751==    by 0x4F35DD: SplitMerge (sort.c:3257)
==17751==    by 0x4F5310: StoreTerm (sort.c:4357)
==17751==    by 0x4C5732: Generator (proces.c:3190)
==17751==    by 0x4C74D2: Generator (proces.c:3987)
==17751==    by 0x40973C: execarg (argument.c:658)
==17751==    by 0x4C5EA4: Generator (proces.c:3421)
==17751==    by 0x5255D2: RunThread (threads.c:1386)
==17751==    by 0x5BAE6B9: start_thread (pthread_create.c:333)
==17751==    by 0x5ECB41C: clone (clone.S:109)
==17751== 
==17751== Conditional jump or move depends on uninitialised value(s)
==17751==    at 0x4F3893: SplitMerge (sort.c:3329)
==17751==    by 0x4F5310: StoreTerm (sort.c:4357)
==17751==    by 0x4C5732: Generator (proces.c:3190)
==17751==    by 0x4C74D2: Generator (proces.c:3987)
==17751==    by 0x40973C: execarg (argument.c:658)
==17751==    by 0x4C5EA4: Generator (proces.c:3421)
==17751==    by 0x5255D2: RunThread (threads.c:1386)
==17751==    by 0x5BAE6B9: start_thread (pthread_create.c:333)
==17751==    by 0x5ECB41C: clone (clone.S:109)
==17751== 
==17751== Conditional jump or move depends on uninitialised value(s)
==17751==    at 0x4F46E1: MergePatches (sort.c:3967)
==17751==    by 0x4F5413: StoreTerm (sort.c:4386)
==17751==    by 0x4C5732: Generator (proces.c:3190)
==17751==    by 0x4C74D2: Generator (proces.c:3987)
==17751==    by 0x40973C: execarg (argument.c:658)
==17751==    by 0x4C5EA4: Generator (proces.c:3421)
==17751==    by 0x5255D2: RunThread (threads.c:1386)
==17751==    by 0x5BAE6B9: start_thread (pthread_create.c:333)
==17751==    by 0x5ECB41C: clone (clone.S:109)
==17751== 
StageSort in thread 1
==17751== Warning: set address range perms: large range [0x1d0c59040, 0x1f00b8b80) (undefined)
==17751== Warning: set address range perms: large range [0x1f00b9040, 0x20f518b80) (undefined)
==17751== Syscall param read(buf) points to unaddressable byte(s)
==17751==    at 0x5BB751D: ??? (syscall-template.S:84)
==17751==    by 0x523411: Uread (unixfile.c:134)
==17751==    by 0x5119D2: ReadFile (tools.c:1158)
==17751==    by 0x42DCFA: FillInputGZIP (compress.c:708)
==17751==    by 0x4EFCD7: PutIn (sort.c:1276)
==17751==    by 0x4F4064: MergePatches (sort.c:3734)
==17751==    by 0x4F61FD: EndSort (sort.c:1074)
==17751==    by 0x409838: execarg (argument.c:680)
==17751==    by 0x4C5EA4: Generator (proces.c:3421)
==17751==    by 0x5255D2: RunThread (threads.c:1386)
==17751==    by 0x5BAE6B9: start_thread (pthread_create.c:333)
==17751==    by 0x5ECB41C: clone (clone.S:109)
==17751==  Address 0x1d0356970 is 0 bytes after a block of size 9,443,632 alloc'd
==17751==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==17751==    by 0x512D3F: Malloc1 (tools.c:2250)
==17751==    by 0x4EC81E: AllocSort (setfile.c:904)
==17751==    by 0x4EFB30: NewSort (sort.c:624)
==17751==    by 0x4096A3: execarg (argument.c:641)
==17751==    by 0x4C5EA4: Generator (proces.c:3421)
==17751==    by 0x5255D2: RunThread (threads.c:1386)
==17751==    by 0x5BAE6B9: start_thread (pthread_create.c:333)
==17751==    by 0x5ECB41C: clone (clone.S:109)
==17751== 
  0.20 sec + 1.50 sec: 1.70 sec out of 1.67 sec
==17751== 
==17751== HEAP SUMMARY:
==17751==     in use at exit: 8,649,290,592 bytes in 340 blocks
==17751==   total heap usage: 421 allocs, 81 frees, 8,659,384,544 bytes allocated
==17751== 
==17751== LEAK SUMMARY:
==17751==    definitely lost: 0 bytes in 0 blocks
==17751==    indirectly lost: 0 bytes in 0 blocks
==17751==      possibly lost: 0 bytes in 0 blocks
==17751==    still reachable: 8,649,290,592 bytes in 340 blocks
==17751==         suppressed: 0 bytes in 0 blocks
==17751== Rerun with --leak-check=full to see details of leaked memory
==17751== 
==17751== For counts of detected and suppressed errors, rerun with: -v
==17751== Use --track-origins=yes to see where uninitialised values come from
==17751== ERROR SUMMARY: 49 errors from 5 contexts (suppressed: 0 from 0)
tueda commented 5 years ago

I modified the test suite script such that it can realises Syscall param read(buf) points to unaddressable byte(s) in the above is a memory bug. The test on Travis CI now fails: https://travis-ci.org/vermaseren/form/builds/437423864.

The "Conditional jump or move depends on uninitialised value(s)" of S->PolyWise are easily fixable as Jos suggested (private communication).

The "pointing to unaddressable bytes" looks a real and more serious buffer-overflow bug.

ahamaline commented 4 months ago

Reporting that I got an "Unexpected code in ReNumber" crash. I was running TFORM 5.0.0-beta.1 (Mar 15 2024, v5.0.0-beta.1-42-g2663e14) 127 workers. The crash occurred after a very long sequence of StageSorts within one module; an average of 440 for each of the 127 threads.

Is this likely to be the same bug you guys were working on here, and if so what's a good strategy to avoid it? I guess I anyway need a much bigger LargePatches setting to get a sane runtime. Will this get rid of the crash as well?

jodavies commented 4 months ago

As I understand, it is hard to say: the Unexpected code in ReNumber error was probably due to messed up internal state due to some wrong buffer sizes used somewhere.

Are you able to provide an example of your input and form script in this case? Is it working with a lot of large function arguments or dollar variables? These days form is supposed to terminate with an error, if a function arg or dollar sort reaches stage4 sorting.

As a first tip, I would say that if you have any other parallelization opportunities (over diagrams, for eg) you should run more concurrent form jobs with fewer threads. The performance does not scale so well up to 128 threads (this statement does depend heavily on what exactly the form script does, though) and with fewer threads, each thread gets more memory space and you shouldn't stagesort as often or at all. Increasing your buffer sizes to avoid stagesorts will also help.

ahamaline commented 4 months ago

If I'm not mistaken, the module in question is just one line: id tenredgg(?mu) = dd_(?mu); So yes, function arguments are involved. Unfortunately I don't understand this part of the code so well, so I don't know what those arguments look like in my case.

jodavies commented 4 months ago

It seems like you are working at the "ground level" though, so the stagesorts are caused by a large number of generated terms, and your crash shouldn't be related to sub-buffers.

Are you able to provide the expression just before this module, if it is not too big already, or at least a sample of the terms?

ahamaline commented 4 months ago

I did not print any terms before the crash, but I do have a recovery file. Is that useful to you?

jodavies commented 4 months ago

I don't think so, without all of the scripts which come with it. Did it take a long time to arrive at this module (i.e., can you just add a print statement in the module before, and re-run) ?

ahamaline commented 4 months ago

It did take a couple of hours to reach that point, and there are already ~50M terms before the module.

But looking at the code, I think the arguments of tenredgg are a list of indices, perhaps up to 12 of them, that are taken from powers of a loop momentum. The dd_ then sets up all of the contractions of pairs of those indices, creating dotproducts between external momenta and polarization vectors. Does that tell you what you wanted to know?

jodavies commented 4 months ago

Without seeing the whole terms I can't tell what the d_ are going to end up contracting with etc, so I it seems unlikely that your crash can be reproduced without more information.

To continue investigating, I would add a print before the crashing module and simply re-run the code with the same threads/memory settings etc. Then you can see if you can reproduce the crash yourself, and at the same time we can see what the expression looks like exactly.

vermaseren commented 4 months ago

Advise: Put a bracket statement in tenredgg.

On 8 Apr 2024, at 17:25, jodavies @.***> wrote:

Without seeing the whole terms I can't tell what the d_ are going to end up contracting with etc, so I it seems unlikely that your crash can be reproduced without more information.

To continue investigating, I would add a print before the crashing module and simply re-run the code with the same threads/memory settings etc. Then you can see if you can reproduce the crash yourself, and at the same time we can see what the expression looks like exactly.

— Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/211#issuecomment-2043048788, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJPCEQMA72C4IYOOY64NMLY4KZFFAVCNFSM4DRKVRJKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBUGMYDIOBXHA4A. You are receiving this because you modified the open/close state.

ahamaline commented 4 months ago

Advise: Put a bracket statement in tenredgg. Can you explain more? Is this to help avoid the crash, or to investigate it?

vermaseren commented 4 months ago

It is for the investigation. I have two methods: 1: B tenredgg; Print; .sort id tenredgg(?mu) = dd(?mu); 2: This is for when the crash occurs before the final sort: Print "%t”; id tenredgg(?mu) = dd(?mu); This last one would give the offending term as the last term printed.

If tenredgg occurs more than once in some terms, the best would be to do just one of them and then sort, and then one again etc.

On 8 Apr 2024, at 17:56, ahamaline @.***> wrote:

Advise: Put a bracket statement in tenredgg. Can you explain more? Is this to help avoid the crash, or to investigate it?

— Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/211#issuecomment-2043118309, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJPCEUSKGASOJU6P7ZFHJ3Y4K42BAVCNFSM4DRKVRJKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBUGMYTCOBTGA4Q. You are receiving this because you modified the open/close state.

ahamaline commented 4 months ago

OK, I'm running it now with the first of those options.

ahamaline commented 4 months ago

So I got to the printout but not yet to the crash. Here is an example term: outside the brackets we have tenredgg[mu1,mu2], and within the brackets one of the terms is 2/( - 1 + D)*Etens[p2,mu2]* Etens[p2,nu97]*Etens[p1,mu1]*Etens[p1,nu99]*Mom[p2,nu2]*Mom[p1,nu1]^2 *Mom[p1,nu2]*Mom[p1,nu97]*Mom[p1,nu99]*Mom[p1,mudotp1]^2*Mom[p1, mudotp2]^2*prop[k1,0]^4*den[4,2]*(k1.k1)^2*sSS^10*D*kappa^2*mh^(-6) Of course the more problematic terms are those where tenredgg has many indices, but I can't seem to scroll far enough to get to those... The idea should be the same; each index appears exactly twice in either tenredgg, Mom, or Evect. There are always 4 powers of Evect but the powers of Mom vary.

'Mom[p2,nu2]' is really just a way to write p2[nu2], chosen specifically to avoid having the vector contract automatically.

ahamaline commented 4 months ago

Ok, I found an example term with sixteen indices in tenredgg (in the printout before the module): tenredgg[mu1,mu2,mu4,nu1,nu99,mudotp1,mudotp2,mudotp3,mudotp4,mudotp5 ,mudotp6,mudotp7,mudotp8,mudotp9,mudotp10,mudotp11] * ( 16384/( - 1 + D) /( - 1 + D)*Etens[p2,mu2]*Etens[p2,mu4]*Etens[p1,mu1]*Etens[p1,nu99]* Mom[p1,nu1]*Mom[p1,mudotp1]*Mom[p1,mudotp2]*Mom[p1,mudotp3]*Mom[p1, mudotp4]*Mom[p1,mudotp5]*Mom[p1,mudotp6]*Mom[p1,mudotp7]*Mom[p1, mudotp8]*Mom[p1,mudotp9]*Mom[p1,mudotp10]*Mom[p1,mudotp11]*prop[k1,0] ^13*den[92897280,169906176]*(k1.k1)^11*sSS^12*kappa^2*mh^(-8) +...)

The code has been running for 17.5 hours now and has neither crashed nor finished the module.

jodavies commented 4 months ago

If you resolve the Mom into actual scalar products where you have repeated indices you will get a much smaller result, but I don't know what the rest of your code does and it is not really relevant to the fact that you got the crash.

I imagine this is not going to be easily reproducible. Possibly with a larger portion of your input expression and artificially small buffer settings or something like that.

Do you have the log file from the crashed run?

vermaseren commented 4 months ago

Another thing that may influence the running: You have 1/(-1+D). This denominator treatment is notoriously weak and it is better to define your own denominator function as in CF Den; and then use Den(-1+D). That way it is also easier to have statements like repeat id D*Den(-1+D) = 1+Den(-1+D); or more general code as you can find in the second and third lecture in https://www.nikhef.nl/~form/maindir/courses/uam2022/uam2022.html Although not likely, it is not inconceivable that the error is related to this, because I cannot see what else you do with these denominators.

On 9 Apr 2024, at 14:23, jodavies @.***> wrote:

If you resolve the Mom into actual scalar products where you have repeated indices you will get a much smaller result, but I don't know what the rest of your code does and it is not really relevant to the fact that you got the crash.

I imagine this is not going to be easily reproducible. Possibly with a larger portion of your input expression and artificially small buffer settings or something like that.

Do you have the log file from the crashed run?

— Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/211#issuecomment-2045049943, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJPCEST3MGLWAMXBYUHBNDY4PMTPAVCNFSM4DRKVRJKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBUGUYDIOJZGQZQ. You are receiving this because you modified the open/close state.

ahamaline commented 4 months ago

Yes, I agree that the code is absurdly inefficient for my use case. I did not write it and I don't understand the reasons for the authors' choices, so changing things is scary, but it looks like the savings would be worthwhile. The large number of terms is mainly due to the combinatorial explosion of index configurations, which would be avoided if we used either vectors or dummy indices.

I did not save the log file; it did not have much that seemed informative. Just 56,000 StageSort reports from the various threads, followed by the error.

Thank you for the tip about avoiding denominators. But how could the denominator be related to a crash that occurs in the dd_ replacement?

I will add that the error must not be deterministic, since this run has already continued for many more StageSorts than the one that crashed.