typelead / eta

The Eta Programming Language, a dialect of Haskell on the JVM
https://eta-lang.org
BSD 3-Clause "New" or "Revised" License
2.61k stars 141 forks source link

Optimization produces incorrect results #603

Closed rahulmutt closed 6 years ago

rahulmutt commented 6 years ago

@jarekratajski reported this:

fibbtcoinner 0 sum presum  = sum
fibbtcoinner n sum presum = fibbtcoinner  (n-1) (sum + presum) sum

fibbtco n = fibbtcoinner n 1 0

fibbnaive  0 = 1
fibbnaive  1 = 1
fibbnaive  n = fibbnaive ( n-1) + fibbnaive ( n - 2)

main = do
        putStrLn $ show $ fibbtco <$> arr
        putStrLn $ show $ fibbnaive <$> arr
        where
arr = [0..10]

produces

[1,1,2,4,8,16,32,64,128,256,512]
[1,1,2,3,5,8,13,21,34,55,89]

when it should produce:

[1,1,2,3,5,8,13,21,34,55,89]
[1,1,2,3,5,8,13,21,34,55,89]

Reproduced on v0.7.

rahulmutt commented 6 years ago

More info: https://gist.github.com/jarekratajski/8b28fa61bde683e4bfd6352c9068a6ec

This may be related to #478.

jarekratajski commented 6 years ago

To be more precise: with -O0 problem prevails. Correct behavior only with -O0 AND BangPatterns params. I am just looking at code of eta compiler... do you have any hint where to look at first?

jarekratajski commented 6 years ago

While analyzing decompiled code :

     public final Closure apply6(StgContext var1, Closure var2, Closure var3, Closure var4, Closure var5, 
             Closure var6, Closure var7) {
            Object var9 = null; //i guess n
            Object var10 = null; //i guess sum
            Object var11 = null; //i guess presum
            boolean var8 = false;

            while(true) {
                while(var8) {
                    Main.sat_s7YH var12 = new Main.sat_s7YH(var3);
                    var1.R1 = var2;
                    Closure var13 = Classes.zeze().enter(var1).apply2(var1, (Closure)var9, var12);
                    if (!(var13 instanceof False)) {
                        return ((Closure)var10).evaluate(var1);
                    }

                    Main.sat_s7YM var14 = new Main.sat_s7YM(var4, (Closure)var10, (Closure)var11); //this must be sum + presum
                    Main.sat_s7YL var15 = new Main.sat_s7YL(var3, (Closure)var9); // this must be n-1
                    var9 = var15;
                    var10 = var14; // here is the error IMO 
                    var11 = var14; // 
                  /* this would be correct
                    var11 = var10; 
                    var10 = var14; 

                     */
                    var8 = true;
                }

                var9 = var5;
                var10 = var6;
                var11 = var7;
                var8 = true;
            }
        }

I am only guessing... but this brought me to idea of swapping order of params :

   fibbtcoinner 0 presum sum  = sum
   fibbtcoinner n presum sum = fibbtcoinner  (n-1) sum (sum + presum)

   fibbtco n = fibbtcoinner n 0 1

and now this code gives correct results also in O2. Hope this helps.

rahulmutt commented 6 years ago

Thanks for taking a look! So swapping the order of params fixes it. We still need to investigate why the original problem happened.

The way to debug stuff like this is: 1) Check that the optimized intermediate code (in this case, STG) does what's expected. To check this, you can:

2) Now if everything looks fine at the STG level, the next thing to look at is what you did above - the generated Java code. We have to make sure that the translation is happening correctly - most bugs come out of a mismatch between STG & the generated code, and most of the time the issue is with the generated code.

jarekratajski commented 6 years ago

Disclaimer: I am total newbie to Haskell and GHC and even compilers. Sorry. But looking at STG I think it is proper

Fibb.fibbtcoinner
  :: forall a_a7U9 a1_a7UA.
     (GHC.Classes.Eq a_a7U9, GHC.Num.Num a_a7U9, GHC.Num.Num a1_a7UA) =>
     a_a7U9 -> a1_a7UA -> a1_a7UA -> a1_a7UA
[GblId,
 Arity=6,
 Caf=NoCafRefs,
 Str=<S(C(C(S))L),U(C(C1(U)),A)><L,U(A,C(C1(U)),A,A,A,A,C(U))><L,U(C(C1(U)),A,A,A,A,A,A)><L,U><L,U><L,U>,
 Unf=OtherCon []] =
    \r srt:SRT:[] [$dEq_s8MJ
                   $dNum_s8MK
                   $dNum1_s8ML
                   eta_s8MM
                   eta1_s8MN
                   eta2_s8MO]
        let {
          lvl1_s8MP [Occ=OnceL] :: a_a7UF
          [LclId, Str=] =
              \u srt:SRT:[] []
                  GHC.Num.fromInteger $dNum_s8MK Fibb.fibbnaive3; } in
        let {
          lvl2_s8MQ [Occ=OnceL] :: a_a7UF
          [LclId, Str=] =
              \u srt:SRT:[] [] GHC.Num.fromInteger $dNum_s8MK Fibb.fibbnaive1;
        } in 
          let-no-escape {
            fibbtcoinner1_s8MR [Occ=LoopBreaker]
              :: a_a7UF -> a1_a7UG -> a1_a7UG -> a1_a7UG
            [LclId, Arity=3, Str=<L,U><L,U><L,U>, Unf=OtherCon []] =
                sat-only \r srt:SRT:[] [ds_s8MS sum_s8MT presum_s8MU]
                    case GHC.Classes.== $dEq_s8MJ ds_s8MS lvl2_s8MQ of _ [Occ=Dead] {
                      GHC.Types.False ->
                          let {
                            sat_s8MX [Occ=Once] :: a1_a7UG
                            [LclId, Str=] =
                                \u srt:SRT:[] [] GHC.Num.+ $dNum1_s8ML sum_s8MT presum_s8MU; } in
                          let {
                            sat_s8MW [Occ=Once] :: a_a7UF
                            [LclId, Str=] =
                                \u srt:SRT:[] [] GHC.Num.- $dNum_s8MK ds_s8MS lvl1_s8MP;
                          } in  fibbtcoinner1_s8MR sat_s8MW sat_s8MX sum_s8MT;
                      GHC.Types.True -> sum_s8MT;
                    };
          } in  fibbtcoinner1_s8MR eta_s8MM eta1_s8MN eta2_s8MO;

Because the call (loop?) has in fact 2 distinct variables: } in fibbtcoinner1_s8MR sat_s8MW sat_s8MX sum_s8MT; Seems that Java produced somehow messes sat_s8MX with sum_s8MT. But at least here at STG level they seem to be different.

rahulmutt commented 6 years ago

Here's the output with -O2 which is less code to look at and makes the problem a lot more clear (it's just as you pointed out):

  public static Closure main_fibbtcoinner(StgContext paramStgContext, Closure paramClosure1, Closure paramClosure2, Closure paramClosure3)
  {
    for (;;)
    {
      Type.eqIntegerzh(paramStgContext, paramClosure1, main6());
      int i = paramStgContext.I1;
      if (i == 1) {
        break;
      }
      Closure localClosure1 = Type.plusInteger(paramStgContext, paramClosure2, paramClosure3);
      Closure localClosure2 = Type.minusInteger(paramStgContext, paramClosure1, main5());
      paramClosure1 = localClosure2;
      paramClosure2 = localClosure1;
      paramClosure3 = paramClosure2;
    }
    return paramClosure2.evaluate(paramStgContext);
  }
rahulmutt commented 6 years ago

This problem is extremely subtle which is why we didn't encounter this yet. It is more likely to happen with very small and simple recursive functions which don't transform all their arguments when making a recursive call.

The problem is that last line paramClosure3 = paramClosure2 which corresponds to passing the sum parameter in the recursive call in a different position. The issue is that this needs to be re-ordered. A topological sort must be done to ensure that we don't set the bindings in incorrect order.

It should be done like this:

      paramClosure3 = paramClosure2;
      paramClosure1 = localClosure2;
      paramClosure2 = localClosure1;

or even this:

      paramClosure1 = localClosure2;
      paramClosure3 = paramClosure2;
      paramClosure2 = localClosure1;

You made nice observations for not looking at this stuff before :)

jarekratajski commented 6 years ago

Does it have to do something with?

-- TODO: Verify that this is valid in all cases,
--       otherwise fall back on the strongly connected components
--       algorithm a la GHC
multiAssign :: [CgLoc] -> [Code] -> Code
multiAssign locs codes = fold $ zipWith storeLoc locs codes

in Layout.sh

rahulmutt commented 6 years ago

Yes exactly that :) It turns out that #478 has the same problem so that will also be fixed if we fix this.

The tricky part here is that we can't really inspect inside a Code data type since it just has a monadic action that generates it when supplied with a constant pool and some other parameters. This lack of introspection actually cripples our ability to do bytecode optimization, so we really need to create an API to work with the direct bytecode AST instead.

We'll need to expose a simple analysis in codec-jvm that runs the Instr monad and inspects the resulting ByteString for the *load instructions and returns a [Int] - all the slots that were loaded from in the code. Would you be interested in giving this a try?

rahulmutt commented 6 years ago

For reference, this is GHC's code to handle the problem: https://github.com/ghc/ghc/blob/master/compiler/codeGen/StgCmmUtils.hs#L375-L444

jarekratajski commented 6 years ago

I understand the problem and more or less get the code. I will try - but as for now I am too weak to really solve it. So please do not count on me too much.

rahulmutt commented 6 years ago

Sure no worries. Give it a try and share any attempts here.

rahulmutt commented 6 years ago

@jarekratajski I was taking a look at multiAssign usages and it turns out that both code locations allow you to access CgLoc (Code-generation locations) where they are currently sending in Code. CgLoc's you can at least pattern match on so it would be easier to construct the topological sort.

jarekratajski commented 6 years ago

I did small trial where I fix only one call of multiAssign and it seems to work. There are some dirty places there. The fix also does not solve the problem of potential Cycles. I've created another sample with cycle and it is still broken. I will continue working on this.

rahulmutt commented 6 years ago

Ok great. I gave some feedback on your work so far - great start! Please put your test cases in tests/basic/multiAssign/. We don't run those tests yet, but we will eventually and it'll be great to keep track of them.

rahulmutt commented 6 years ago

@jarekratajski Do you have a timeline for the fix for this? Ideally, I'd like to see a fix by the end of the week since this is a pretty important bug. If your current schedule does not allow for that, please create a PR with the work you've done so far and I'll finish up the remaining bits.

jarekratajski commented 6 years ago

@rahulmutt I will have some time in next 48 hours for that - so hopefully I can do it or at least push something better.

rahulmutt commented 6 years ago

@jarekratajski Great, thanks for the update.

jarekratajski commented 6 years ago

@rahulmutt I need some help. I've tried to port 1 by 1 iines from GHC with Digraph and it does not build packages anymore.

And I do not know how to "debug" it efficiently.

I suspect I've mixed somehow "stores" with "loads" in Layout.hs unscramble or I did not get how CodeGen monad works Line 161 - 162 in Expr.hs ... are not codes generated by emitMultiAssign lost?

If I only knew how to print debugs during build process sensibly maybe I would find out.

rahulmutt commented 6 years ago

@jarekratajski Can you show me a diff of your changes? I can take a look at where you are and see if I can help you forward.

Are you getting compile-time errors or run-time errors? I can help out with debugging methods if you're specific on what kind of info you need.

jarekratajski commented 6 years ago

moment... maybe I did very stupid thing and compiler even says me what code is as ion https://github.com/jarekratajski/eta/commit/3ac051d667644dc83cd1afa9712cb41495af4e80 basically I was preparing PR there

rahulmutt commented 6 years ago

I noticed that you changed the version number from 0.7.0 to 0.0.9 - that may be causing problems. The build relies on the fact that all the version numbers are consistent. What is happening when you run ./cleaninstall.sh?

jarekratajski commented 6 years ago

What I need is to compile eta - without recompiling base packages. I've "screweed" something but I need to test it on a single small file. Otherwise I wil not find out.

I did a stupid pull request to easily track changes.

jarekratajski commented 6 years ago

Version number was changed before - that is now fixed.

jarekratajski commented 6 years ago

cleaninstall.sh results in


uilding library for ghc-prim-0.4.0.0..
[1 of 5] Compiling GHC.Types        ( GHC/Types.hs, dist/build/GHC/Types.jar )
[2 of 5] Compiling GHC.Tuple        ( GHC/Tuple.hs, dist/build/GHC/Tuple.jar )
[3 of 5] Compiling GHC.CString      ( GHC/CString.hs, dist/build/GHC/CString.jar )
eta: panic! (the 'impossible' happened)
  (Eta version 0.7.0b1):
        Prelude.head: empty list

Must be that I somehow do not emit correct assigns now.

jarekratajski commented 6 years ago

Or after last change in Layout.hs I get now. [4 of 5] Compiling GHC.Magic ( GHC/Magic.hs, dist/build/GHC/Magic.jar ) [5 of 5] Compiling GHC.Classes ( GHC/Classes.hs, dist/build/GHC/Classes.jar ) eta: panic! (the 'impossible' happened) (Eta version 0.7.0b1): getCgIdInfo: no module eta_B1

which probably happens here: getGetDepCgLoad (NonVoid (StgVarArg var)) = Right <$> cgLocation <$> getCgIdInfo var

rahulmutt commented 6 years ago

@jarekratajski A nice way to debug codegen stuff is to add -ddump-cg-trace into the project you want to debug. In this case it's ghc-prim - so go to libraries/ghc-prim/ghc-prim.cabal and add -ddump-cg-trace -ddump-to-file to the ghc-options: field. In the trace, you can see which precise part of the codegen choked on your changes and where eta_B1 comes from.

rahulmutt commented 6 years ago

That issue typically happens when you try to load an Id which hasn't been inserted into the bindings yet in the CodeGen monad. Seems like an ordering problem.

jarekratajski commented 6 years ago

Thanks. I've changed it ghc-options: -ddump-cg-trace -ddump-to-file -this-unit-id ghc-prim

but still do not see any traceCG outputs. even in the created file

jarek@ubuntu:~/.etlas/logs/eta-0.7.0.1$ cat ghc-prim-0.4.0.0-Jhi6UgHuZdoBZWUpVo3WKE.log 
etlas: Entering directory '.'
Configuring ghc-prim-0.4.0.0...
Preprocessing library for ghc-prim-0.4.0.0..
Building library for ghc-prim-0.4.0.0..
[1 of 5] Compiling GHC.Types        ( GHC/Types.hs, dist/build/GHC/Types.jar )
[2 of 5] Compiling GHC.Tuple        ( GHC/Tuple.hs, dist/build/GHC/Tuple.jar )
[3 of 5] Compiling GHC.CString      ( GHC/CString.hs, dist/build/GHC/CString.jar )
[4 of 5] Compiling GHC.Magic        ( GHC/Magic.hs, dist/build/GHC/Magic.jar )
[5 of 5] Compiling GHC.Classes      ( GHC/Classes.hs, dist/build/GHC/Classes.jar )
eta: panic! (the 'impossible' happened)
  (Eta version 0.7.0b1):
    getCgIdInfo: no module eta_B1

Please report this as a Eta bug: http://github.com/typelead/eta/issues

etlas: Leaving directory '.'
rahulmutt commented 6 years ago

I changed those exact same options and I was able to find it here:

libraries/ghc-prim/dist/build/GHC/CString.dump-cg-trace (for example)

jarekratajski commented 6 years ago

Thanks. That was it. Sorry, had no idea which file to look at. I see as last lines of Classes.dump-cg-trace

StgOpApp tagToEnum# [sat_sZF7] Bool
generating  $wccall
creating new closure...$wccall
StgOpApp __pkg_ccall_value ghc-prim False|False|2,True,False,False,"java/lang/Comparable","compareTo","(Ljava/lang/Object;)I" [sZFA] [ds_sZF8,
                                                                                                                                      ds1_sZF9,
                                                                                                                                      eta_B1] (# State#
                                                                                                                                                   RealWorld,
                                                                                                                                                 Int# #)
StgApp sat_sZFB [eta_B1]
cgIdApp: JumpToIt

so it must fail on deps <- dependencies args and then probably on this line

getGetDepCgLoad (NonVoid (StgVarArg var)) = Right <$> cgLocation <$> getCgIdInfo var

where getCgIdInfo fails

jarekratajski commented 6 years ago

After addeing some debug

JumpToIt label cgLocs mLne -> do
      traceCg (str "cgIdApp: JumpToIt")
      --  codes <- getNonVoidArgCodes args
      printBindings
      traceCg (ppr args)
      deps <- dependencies args
      traceCg (str "JAREK: deps calculated")
      emitMultiAssign cgLocs deps
      emit $ maybe  mempty
                                              (\(target, targetLoc) ->
                                               storeLoc targetLoc (iconst (locFt targetLoc) $ fromIntegral target))
                                               mLne
                  <>   goto label

I get the file as in attachment: Classes.dump-cg-trace.txt

I do not know why getCgIdInfo var does not work even though I've copied more or less code from the previous getNonVoidArgCodes function. (Very likely it is something obvious and stupid that I do not see. My experience with haskell is mostly trying to solve this bug + doing exercises with quicksort and fibonacci ... not that great).

jarekratajski commented 6 years ago

And I see. I have not fully copied logic from getNonVoidArgCodes .. cause I forgot about nonvoid check.

jarekratajski commented 6 years ago

PR is is ready for review. I've added test cases that work. Cycles also seem to work.

One strange thing is I had to actually reverse results of "topological sort" from Digraph see Line 83 from Layout.hs - but with this reverse all makes sense.

For reference commit #2582101 contains lot of debugs that can be used (show how graph looks like).

@rahulmutt Thank You for all the help! That was really nice way for me to learn Haskell (and do sth else than boring JavaEE/Tomcat stuff :-) I do at work ). I will have some time to fix polish this code if you have further hints. (Sorry for style and naming - probably I am doing Javaskell a little).

rahulmutt commented 6 years ago

Don't worry about styling/naming - I'll do it later in a cleanup commit. You got the core fix working which is enough :) I'll merge it now.

rahulmutt commented 6 years ago

Verified the master's fix is working. Closing. Thanks again @jarekratajski.