replace_ with nested CFunctions crashes

vermaseren / form

The FORM project for symbolic manipulation of very big expressions

GNU General Public License v3.0

982 stars 118 forks source link

replace_ with nested CFunctions crashes #307

Closed abehring closed 3 years ago

abehring commented 5 years ago

I've encountered a problem with some code that has been working for over 10 years when I tried upgrading to a newer version of FORM. I've boiled down the issue to the following minimal example:

S N,j1;
CF cfun1,cfun2;
V p;

L test = cfun1(cfun2(-p,-3 - j1 + N)*cfun2(p,j1)) * cfun2(p,j1);

multiply replace_(N,3);

Print;
.end

Running with the latest version from github (087a772e613db4c75293a99c2b0c4491a85303ab) compiled with gcc 7.3.0-27ubuntu1~18.04 on amd64 yields

FORM 4.2.1 (Feb  6 2019, v4.2.1-2-g087a772) 64-bits  Run: Mon Feb 11 08:02:59 2019
    S N,j1;
    CF cfun1,cfun2;
    V p;

    L test = cfun1(cfun2(-p,-3 - j1 + N)*cfun2(p,j1)) * cfun2(p,j1);

    multiply replace_(N,3);

    Print;
    .end
!!!This $ variation has not been implemented yet!!!
!!!This $ variation has not been implemented yet!!!
Called from TestSub
Program terminating at rep3.frm Line 9 -->
  0.00 sec out of 0.00 sec

I've used git bisect to find the first commit where this program fails and I found that something happens at commit 29e608ee0e2770ef7a8215214ec901088af5af46. Beginning with it I simply get the message Program terminating at rep3.frm Line 9 --> but no further messages.

The exact message ("This $ variation ...") first appears with commit 2e409bce28bf048697ef2fe9aa38f22b596d2ebd.

Interestingly, a small modification of the above program, i.e., replacing the expression by

L test = cfun1(j1,0, - 3 + N,cfun2(-p, - 3 - j1 + N)*cfun2(p,j1));

Yields the result (note the non-sensical rational prefactor):

18446744078004518933/2803905099190966943747*cfun1(j1,0,0,cfun2(-p, - j1)*cfun2(p,j1));

Other modifications lead to programs that seem to run forever (longer than I cared to wait).

Valgrind's memcheck outputs several messages along the lines of Conditional jump or move depends on uninitialised value(s), but I haven't compiled form with debug symbols yet so I cannot give exact line numbers.

tueda commented 5 years ago

Clearly, something seems to be still completely broken in replace_ with nested functions:

FORM 4.2.1 (Feb  6 2019, v4.2.1-2-g087a772) 64-bits  Run: Mon Feb 11 17:30:31 2019
    CF f;
    S x;
    L F = f(f(x+1));
    P "%t";
    P "%r";
    multiply replace_(x,1);
    P "%t";
    P "%r";
    .end
 + f(f(1 + x))
30  150  26  0  23  0  21  150  17  0  14  0  4  1  1  3  8  1  4  20  1  1  1  3
  1  1  3  1  1  3
 + 2*f(0 + f(2))
20  150  16  0  13  0  2  1  9  150  5  0  -16  2  1  1  3  2  1  3

tueda commented 5 years ago

@vermaseren Is the following result of Normalize() for f(f(2*x)) * replace_(x,98) correct?

{26, 150, 22, 1, 19, 1, 17, 150, 13, 1, 10, 1, 8, 16, 4, 98, 1, 2, 1, 3, 1, 1, 3, 1, 1, 3}

What is the data format of SNUMBER = 16, i.e., 16, 4, 98, 1??

benruijl commented 3 years ago

In commit 29e608e, the modifications in proces.c caused this issue. The goto redosize; is replaced by some of the code that the goto refers to. However the following lines are not part of the new code:

t1[2] = 1;
if ( *t1 == AR.PolyFun && AR.PolyFunType == 2 )
    t1[2] |= CLEANPRF;
AT.NestPoin--;
AN.TeInFun++;
AR.TePos = 0;
AN.ncmod = oldncmod;
return(retvalue);

If I simply add return(retvalue); to the end of the new code, I get the correct result. Is it indeed correct that we should return in this branch?

I think this bug has high priority to be solved. Sadly the control flow is rather non-trivial and I don't have all the informed yet about what this code does and what the fix should be. @vermaseren can you have a look at this too?

It may be related to https://github.com/vermaseren/form/issues/316.

vermaseren commented 3 years ago

Unfortunately that is not a correct fix and breaks a fix of issue 97. It is indeed not easy there, because the code is making length adjustments of terms inside function arguments which in turn needs then length adjustments in the parent term (and possibly etc).

To be continued……

Jos

On 27 Jul 2020, at 15:37, Ben Ruijl notifications@github.com wrote:

In commit 29e608e https://github.com/vermaseren/form/commit/29e608ee0e2770ef7a8215214ec901088af5af46, the modifications in proces.c caused this issue. The goto redosize; is replaced by some of the code that the goto refers to. However the following lines are not part of the new code:
                          t1[2] = 1;
                          if ( *t1 == AR.PolyFun && AR.PolyFunType == 2 )
                              t1[2] |= CLEANPRF;
                          AT.NestPoin--;
                          AN.TeInFun++;
                          AR.TePos = 0;
                          AN.ncmod = oldncmod;
                          return(retvalue);
If I simply add return(retvalue); to the end of the new code, I get the correct result. Is it indeed correct that we should return in this branch?

I think this bug has high priority to be solved. Sadly the control flow is rather non-trivial and I don't have all the informed yet about what this code does and what the fix should be. @vermaseren https://github.com/vermaseren can you have a look at this too?

It may be related to #316 https://github.com/vermaseren/form/issues/316.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/307#issuecomment-664401494, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJPCETX3YOWHP4YFCADE2TR5V7KHANCNFSM4GWOXBPA.

vermaseren commented 3 years ago

Should work properly now. This fix interfered with a couple of other fixes that were more or less incomplete/‘phenomenological’.

On 27 Jul 2020, at 16:28, Jos Vermaseren t68@nikhef.nl wrote:

Unfortunately that is not a correct fix and breaks a fix of issue 97. It is indeed not easy there, because the code is making length adjustments of terms inside function arguments which in turn needs then length adjustments in the parent term (and possibly etc).

To be continued……

Jos
On 27 Jul 2020, at 15:37, Ben Ruijl <notifications@github.com mailto:notifications@github.com> wrote:

In commit 29e608e https://github.com/vermaseren/form/commit/29e608ee0e2770ef7a8215214ec901088af5af46, the modifications in proces.c caused this issue. The goto redosize; is replaced by some of the code that the goto refers to. However the following lines are not part of the new code:
                         t1[2] = 1;
                         if ( *t1 == AR.PolyFun && AR.PolyFunType == 2 )
                             t1[2] |= CLEANPRF;
                         AT.NestPoin--;
                         AN.TeInFun++;
                         AR.TePos = 0;
                         AN.ncmod = oldncmod;
                         return(retvalue);
If I simply add return(retvalue); to the end of the new code, I get the correct result. Is it indeed correct that we should return in this branch?

I think this bug has high priority to be solved. Sadly the control flow is rather non-trivial and I don't have all the informed yet about what this code does and what the fix should be. @vermaseren https://github.com/vermaseren can you have a look at this too?

It may be related to #316 https://github.com/vermaseren/form/issues/316.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/307#issuecomment-664401494, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJPCETX3YOWHP4YFCADE2TR5V7KHANCNFSM4GWOXBPA.

tueda commented 3 years ago

A test for this issue would be something like this (not merged yet). [Travis log].

Then, 2 questions:

Do you still see some problems? Or can we close it?
What was done in b2def9dfa95eb8784aaf4fd55e6fd28a46ede5c2? I mean, I can see the comment, but do you have any example that didn't work and fixed by it? It would be nice if I could put also such an example to the test. (But, it is somehow already covered.)

vermaseren commented 3 years ago

On that last ‘fix’ I did not construct a failing example. It was just that the code was missing out one step which might occur if you really work at it. I figured that we could wait for someone to come up with an example with sufficiently deep nested functions, but then I would have to start all over again locating what was wrong etc. I had not anticipated the example that needed the first fix, and it took years for someone to make it mess up. Hence, I decided not to wait.

On 11 Aug 2020, at 11:29, Takahiro Ueda notifications@github.com wrote:

A test for this issue would be something like this https://github.com/tueda/form/commit/68c69db9d77bbf946692562387c0187277d3385a (not merged yet). [Travis log https://travis-ci.org/github/tueda/form/builds/716839998].

Then, 2 questions:

Do you still see some problems? Or can we close it? What was done in b2def9d https://github.com/vermaseren/form/commit/b2def9dfa95eb8784aaf4fd55e6fd28a46ede5c2? I mean, I can see the comment, but do you have any example that didn't work and fixed by it? It would be nice if I could put also such an example to the test. (But, it is somehow already covered https://coveralls.io/builds/32648389/source?filename=sources/proces.c#L1553.) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/307#issuecomment-671837978, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJPCEUW2ZH5LQ7JALBMR2DSAEFQ5ANCNFSM4GWOXBPA.

jodavies commented 3 years ago

It is interesting that coveralls shows that new code is used... perhaps one can intentionally cause a crash there and see which example fails?

vermaseren commented 3 years ago

It is not quite that simple. The new code pops the nesting stack nd what it ‘repairs’ is what would happen if the nesting stack would be popped again at a later stage in the same call. If on the other hand the popping would take place in the originating call it is a different story. The original crash was because there was the need for two pops in the same call and the second did not do the first at the same time. By doing them now separately the problem has been solved. There are cases in which the original code caused no problem (apparently most cases) and the crashing case was not quite forseen. But if you can construct an example in which the second fix really is needed, it would be quite good.

On 11 Aug 2020, at 13:23, jodavies notifications@github.com wrote:

It is interesting that coveralls shows that new code is used... perhaps one can intentionally cause a crash there and see which example fails?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/307#issuecomment-671887963, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJPCERGV7WPMRPGZGCBI5LSAES35ANCNFSM4GWOXBPA.

tueda commented 3 years ago

The new code was covered by #105. Well, maybe reasonable... it contains replace_ with functions...

log

``` Check /home/tueda/work/form/sources/vorm FORM 4.2.1 (Aug 11 2020, v4.2.1-26-g68c69db-dirty) 64-bits Run: Tue Aug 11 20:32:25 2020 Loaded suite ./check Started .............................. =============================================================================== Issue105 (fixes.frm:918) FAILED =============================================================================== FORM 4.2.1 (Aug 11 2020, v4.2.1-26-g68c69db-dirty) 64-bits Run: Tue Aug 11 20:32:31 2020 * Crash by replace_(x,0) S x; V p; CF f; L F = f(p.p+x); L G = f(p.p*x); multiply replace_(x,0); P; .end Program terminating at 1.frm Line 10 --> =============================================================================== F ==================================================================================================================================== 926: 927: 928: => 929: def test_Issue105; do_test { 930: assert(succeeded?, 'Failed for succeeded?') 931: assert result("F") =~ expr("f(p.p)") 932: assert result("G") =~ expr("f(0)") /tmp/form_check_20200811-9866-dh3tm4/fixes.rb:929:in `test_Issue105' ./check.rb:272:in `do_test' ./check.rb:272:in `times' ./check.rb:277:in `block in do_test' Failure: test_Issue105(Test_Issue105): timeout (= 10 sec) in 1.frm of Issue105 (fixes.frm:918). is not true. ==================================================================================================================================== .................................................................................................................................... ................................................. Finished in 33.0536936 seconds. ------------------------------------------------------------------------------------------------------------------------------------ 212 tests, 794 assertions, 1 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications 99.5283% passed ------------------------------------------------------------------------------------------------------------------------------------ 6.41 tests/s, 24.02 assertions/s ```

OK, it seems difficult to construct a failing example. So, anyway I can just commit the test case and then I would close this issue for now.