Closed rahulmutt closed 6 years ago
More info: https://gist.github.com/jarekratajski/8b28fa61bde683e4bfd6352c9068a6ec
This may be related to #478.
To be more precise: with -O0 problem prevails. Correct behavior only with -O0 AND BangPatterns params. I am just looking at code of eta compiler... do you have any hint where to look at first?
While analyzing decompiled code :
public final Closure apply6(StgContext var1, Closure var2, Closure var3, Closure var4, Closure var5,
Closure var6, Closure var7) {
Object var9 = null; //i guess n
Object var10 = null; //i guess sum
Object var11 = null; //i guess presum
boolean var8 = false;
while(true) {
while(var8) {
Main.sat_s7YH var12 = new Main.sat_s7YH(var3);
var1.R1 = var2;
Closure var13 = Classes.zeze().enter(var1).apply2(var1, (Closure)var9, var12);
if (!(var13 instanceof False)) {
return ((Closure)var10).evaluate(var1);
}
Main.sat_s7YM var14 = new Main.sat_s7YM(var4, (Closure)var10, (Closure)var11); //this must be sum + presum
Main.sat_s7YL var15 = new Main.sat_s7YL(var3, (Closure)var9); // this must be n-1
var9 = var15;
var10 = var14; // here is the error IMO
var11 = var14; //
/* this would be correct
var11 = var10;
var10 = var14;
*/
var8 = true;
}
var9 = var5;
var10 = var6;
var11 = var7;
var8 = true;
}
}
I am only guessing... but this brought me to idea of swapping order of params :
fibbtcoinner 0 presum sum = sum
fibbtcoinner n presum sum = fibbtcoinner (n-1) sum (sum + presum)
fibbtco n = fibbtcoinner n 0 1
and now this code gives correct results also in O2. Hope this helps.
Thanks for taking a look! So swapping the order of params fixes it. We still need to investigate why the original problem happened.
The way to debug stuff like this is: 1) Check that the optimized intermediate code (in this case, STG) does what's expected. To check this, you can:
-ddump-stg -ddump-to-file
flags when compiling directly with eta
Add eta-options: -ddump-stg -ddump-to-file
in the executable
stanza of your Cabal file if compiling with etlas
STG code is fairly simple to understand if you ignore all the unnecessary info, think of it as a very simple functional language.
If you have trouble following it, share in a Gist and I can help you out with the parts that are tricky.
2) Now if everything looks fine at the STG level, the next thing to look at is what you did above - the generated Java code. We have to make sure that the translation is happening correctly - most bugs come out of a mismatch between STG & the generated code, and most of the time the issue is with the generated code.
Disclaimer: I am total newbie to Haskell and GHC and even compilers. Sorry. But looking at STG I think it is proper
Fibb.fibbtcoinner
:: forall a_a7U9 a1_a7UA.
(GHC.Classes.Eq a_a7U9, GHC.Num.Num a_a7U9, GHC.Num.Num a1_a7UA) =>
a_a7U9 -> a1_a7UA -> a1_a7UA -> a1_a7UA
[GblId,
Arity=6,
Caf=NoCafRefs,
Str=<S(C(C(S))L),U(C(C1(U)),A)><L,U(A,C(C1(U)),A,A,A,A,C(U))><L,U(C(C1(U)),A,A,A,A,A,A)><L,U><L,U><L,U>,
Unf=OtherCon []] =
\r srt:SRT:[] [$dEq_s8MJ
$dNum_s8MK
$dNum1_s8ML
eta_s8MM
eta1_s8MN
eta2_s8MO]
let {
lvl1_s8MP [Occ=OnceL] :: a_a7UF
[LclId, Str=] =
\u srt:SRT:[] []
GHC.Num.fromInteger $dNum_s8MK Fibb.fibbnaive3; } in
let {
lvl2_s8MQ [Occ=OnceL] :: a_a7UF
[LclId, Str=] =
\u srt:SRT:[] [] GHC.Num.fromInteger $dNum_s8MK Fibb.fibbnaive1;
} in
let-no-escape {
fibbtcoinner1_s8MR [Occ=LoopBreaker]
:: a_a7UF -> a1_a7UG -> a1_a7UG -> a1_a7UG
[LclId, Arity=3, Str=<L,U><L,U><L,U>, Unf=OtherCon []] =
sat-only \r srt:SRT:[] [ds_s8MS sum_s8MT presum_s8MU]
case GHC.Classes.== $dEq_s8MJ ds_s8MS lvl2_s8MQ of _ [Occ=Dead] {
GHC.Types.False ->
let {
sat_s8MX [Occ=Once] :: a1_a7UG
[LclId, Str=] =
\u srt:SRT:[] [] GHC.Num.+ $dNum1_s8ML sum_s8MT presum_s8MU; } in
let {
sat_s8MW [Occ=Once] :: a_a7UF
[LclId, Str=] =
\u srt:SRT:[] [] GHC.Num.- $dNum_s8MK ds_s8MS lvl1_s8MP;
} in fibbtcoinner1_s8MR sat_s8MW sat_s8MX sum_s8MT;
GHC.Types.True -> sum_s8MT;
};
} in fibbtcoinner1_s8MR eta_s8MM eta1_s8MN eta2_s8MO;
Because the call (loop?) has in fact 2 distinct variables:
} in fibbtcoinner1_s8MR sat_s8MW sat_s8MX sum_s8MT;
Seems that Java produced somehow messes sat_s8MX with sum_s8MT.
But at least here at STG level they seem to be different.
Here's the output with -O2
which is less code to look at and makes the problem a lot more clear (it's just as you pointed out):
public static Closure main_fibbtcoinner(StgContext paramStgContext, Closure paramClosure1, Closure paramClosure2, Closure paramClosure3)
{
for (;;)
{
Type.eqIntegerzh(paramStgContext, paramClosure1, main6());
int i = paramStgContext.I1;
if (i == 1) {
break;
}
Closure localClosure1 = Type.plusInteger(paramStgContext, paramClosure2, paramClosure3);
Closure localClosure2 = Type.minusInteger(paramStgContext, paramClosure1, main5());
paramClosure1 = localClosure2;
paramClosure2 = localClosure1;
paramClosure3 = paramClosure2;
}
return paramClosure2.evaluate(paramStgContext);
}
This problem is extremely subtle which is why we didn't encounter this yet. It is more likely to happen with very small and simple recursive functions which don't transform all their arguments when making a recursive call.
The problem is that last line paramClosure3 = paramClosure2
which corresponds to passing the sum
parameter in the recursive call in a different position. The issue is that this needs to be re-ordered. A topological sort must be done to ensure that we don't set the bindings in incorrect order.
It should be done like this:
paramClosure3 = paramClosure2;
paramClosure1 = localClosure2;
paramClosure2 = localClosure1;
or even this:
paramClosure1 = localClosure2;
paramClosure3 = paramClosure2;
paramClosure2 = localClosure1;
You made nice observations for not looking at this stuff before :)
Does it have to do something with?
-- TODO: Verify that this is valid in all cases,
-- otherwise fall back on the strongly connected components
-- algorithm a la GHC
multiAssign :: [CgLoc] -> [Code] -> Code
multiAssign locs codes = fold $ zipWith storeLoc locs codes
in Layout.sh
Yes exactly that :) It turns out that #478 has the same problem so that will also be fixed if we fix this.
The tricky part here is that we can't really inspect inside a Code
data type since it just has a monadic action that generates it when supplied with a constant pool and some other parameters. This lack of introspection actually cripples our ability to do bytecode optimization, so we really need to create an API to work with the direct bytecode AST instead.
We'll need to expose a simple analysis in codec-jvm
that runs the Instr
monad and inspects the resulting ByteString
for the *load
instructions and returns a [Int]
- all the slots that were loaded from in the code. Would you be interested in giving this a try?
For reference, this is GHC's code to handle the problem: https://github.com/ghc/ghc/blob/master/compiler/codeGen/StgCmmUtils.hs#L375-L444
I understand the problem and more or less get the code. I will try - but as for now I am too weak to really solve it. So please do not count on me too much.
Sure no worries. Give it a try and share any attempts here.
@jarekratajski I was taking a look at multiAssign
usages and it turns out that both code locations allow you to access CgLoc
(Code-generation locations) where they are currently sending in Code
. CgLoc
's you can at least pattern match on so it would be easier to construct the topological sort.
I did small trial where I fix only one call of multiAssign and it seems to work. There are some dirty places there. The fix also does not solve the problem of potential Cycles. I've created another sample with cycle and it is still broken. I will continue working on this.
Ok great. I gave some feedback on your work so far - great start! Please put your test cases in tests/basic/multiAssign/
. We don't run those tests yet, but we will eventually and it'll be great to keep track of them.
@jarekratajski Do you have a timeline for the fix for this? Ideally, I'd like to see a fix by the end of the week since this is a pretty important bug. If your current schedule does not allow for that, please create a PR with the work you've done so far and I'll finish up the remaining bits.
@rahulmutt I will have some time in next 48 hours for that - so hopefully I can do it or at least push something better.
@jarekratajski Great, thanks for the update.
@rahulmutt I need some help. I've tried to port 1 by 1 iines from GHC with Digraph and it does not build packages anymore.
And I do not know how to "debug" it efficiently.
I suspect I've mixed somehow "stores" with "loads" in Layout.hs unscramble or I did not get how CodeGen monad works Line 161 - 162 in Expr.hs ... are not codes generated by emitMultiAssign lost?
If I only knew how to print debugs during build process sensibly maybe I would find out.
@jarekratajski Can you show me a diff of your changes? I can take a look at where you are and see if I can help you forward.
Are you getting compile-time errors or run-time errors? I can help out with debugging methods if you're specific on what kind of info you need.
moment... maybe I did very stupid thing and compiler even says me what code is as ion https://github.com/jarekratajski/eta/commit/3ac051d667644dc83cd1afa9712cb41495af4e80 basically I was preparing PR there
I noticed that you changed the version number from 0.7.0
to 0.0.9
- that may be causing problems. The build relies on the fact that all the version numbers are consistent. What is happening when you run ./cleaninstall.sh
?
What I need is to compile eta - without recompiling base packages. I've "screweed" something but I need to test it on a single small file. Otherwise I wil not find out.
I did a stupid pull request to easily track changes.
Version number was changed before - that is now fixed.
cleaninstall.sh results in
uilding library for ghc-prim-0.4.0.0..
[1 of 5] Compiling GHC.Types ( GHC/Types.hs, dist/build/GHC/Types.jar )
[2 of 5] Compiling GHC.Tuple ( GHC/Tuple.hs, dist/build/GHC/Tuple.jar )
[3 of 5] Compiling GHC.CString ( GHC/CString.hs, dist/build/GHC/CString.jar )
eta: panic! (the 'impossible' happened)
(Eta version 0.7.0b1):
Prelude.head: empty list
Must be that I somehow do not emit correct assigns now.
Or after last change in Layout.hs I get now. [4 of 5] Compiling GHC.Magic ( GHC/Magic.hs, dist/build/GHC/Magic.jar ) [5 of 5] Compiling GHC.Classes ( GHC/Classes.hs, dist/build/GHC/Classes.jar ) eta: panic! (the 'impossible' happened) (Eta version 0.7.0b1): getCgIdInfo: no module eta_B1
which probably happens here: getGetDepCgLoad (NonVoid (StgVarArg var)) = Right <$> cgLocation <$> getCgIdInfo var
@jarekratajski A nice way to debug codegen stuff is to add -ddump-cg-trace
into the project you want to debug. In this case it's ghc-prim
- so go to libraries/ghc-prim/ghc-prim.cabal
and add -ddump-cg-trace -ddump-to-file
to the ghc-options:
field. In the trace, you can see which precise part of the codegen choked on your changes and where eta_B1
comes from.
That issue typically happens when you try to load an Id
which hasn't been inserted into the bindings yet in the CodeGen
monad. Seems like an ordering problem.
Thanks. I've changed it ghc-options: -ddump-cg-trace -ddump-to-file -this-unit-id ghc-prim
but still do not see any traceCG outputs. even in the created file
jarek@ubuntu:~/.etlas/logs/eta-0.7.0.1$ cat ghc-prim-0.4.0.0-Jhi6UgHuZdoBZWUpVo3WKE.log
etlas: Entering directory '.'
Configuring ghc-prim-0.4.0.0...
Preprocessing library for ghc-prim-0.4.0.0..
Building library for ghc-prim-0.4.0.0..
[1 of 5] Compiling GHC.Types ( GHC/Types.hs, dist/build/GHC/Types.jar )
[2 of 5] Compiling GHC.Tuple ( GHC/Tuple.hs, dist/build/GHC/Tuple.jar )
[3 of 5] Compiling GHC.CString ( GHC/CString.hs, dist/build/GHC/CString.jar )
[4 of 5] Compiling GHC.Magic ( GHC/Magic.hs, dist/build/GHC/Magic.jar )
[5 of 5] Compiling GHC.Classes ( GHC/Classes.hs, dist/build/GHC/Classes.jar )
eta: panic! (the 'impossible' happened)
(Eta version 0.7.0b1):
getCgIdInfo: no module eta_B1
Please report this as a Eta bug: http://github.com/typelead/eta/issues
etlas: Leaving directory '.'
I changed those exact same options and I was able to find it here:
libraries/ghc-prim/dist/build/GHC/CString.dump-cg-trace
(for example)
Thanks. That was it. Sorry, had no idea which file to look at. I see as last lines of Classes.dump-cg-trace
StgOpApp tagToEnum# [sat_sZF7] Bool
generating $wccall
creating new closure...$wccall
StgOpApp __pkg_ccall_value ghc-prim False|False|2,True,False,False,"java/lang/Comparable","compareTo","(Ljava/lang/Object;)I" [sZFA] [ds_sZF8,
ds1_sZF9,
eta_B1] (# State#
RealWorld,
Int# #)
StgApp sat_sZFB [eta_B1]
cgIdApp: JumpToIt
so it must fail on
deps <- dependencies args
and then probably on this line
getGetDepCgLoad (NonVoid (StgVarArg var)) = Right <$> cgLocation <$> getCgIdInfo var
where getCgIdInfo fails
After addeing some debug
JumpToIt label cgLocs mLne -> do
traceCg (str "cgIdApp: JumpToIt")
-- codes <- getNonVoidArgCodes args
printBindings
traceCg (ppr args)
deps <- dependencies args
traceCg (str "JAREK: deps calculated")
emitMultiAssign cgLocs deps
emit $ maybe mempty
(\(target, targetLoc) ->
storeLoc targetLoc (iconst (locFt targetLoc) $ fromIntegral target))
mLne
<> goto label
I get the file as in attachment: Classes.dump-cg-trace.txt
I do not know why getCgIdInfo var does not work even though I've copied more or less code from the previous getNonVoidArgCodes function. (Very likely it is something obvious and stupid that I do not see. My experience with haskell is mostly trying to solve this bug + doing exercises with quicksort and fibonacci ... not that great).
And I see. I have not fully copied logic from getNonVoidArgCodes .. cause I forgot about nonvoid check.
PR is is ready for review. I've added test cases that work. Cycles also seem to work.
One strange thing is I had to actually reverse results of "topological sort" from Digraph see Line 83 from Layout.hs - but with this reverse all makes sense.
For reference commit #2582101 contains lot of debugs that can be used (show how graph looks like).
@rahulmutt Thank You for all the help! That was really nice way for me to learn Haskell (and do sth else than boring JavaEE/Tomcat stuff :-) I do at work ). I will have some time to fix polish this code if you have further hints. (Sorry for style and naming - probably I am doing Javaskell a little).
Don't worry about styling/naming - I'll do it later in a cleanup commit. You got the core fix working which is enough :) I'll merge it now.
Verified the master
's fix is working. Closing. Thanks again @jarekratajski.
@jarekratajski reported this:
produces
when it should produce:
Reproduced on v0.7.