Investigate slow queens problem solution

puffnfresh commented 7 years ago

From @Ingo60 on Twitter:

eta: 3:15.43 301928k

frege 0:34.14 463820k

ghc 0:03.33 5240k

The code is:

https://gist.github.com/Ingo60/b8bfc365d8114144e4f750234a0ac306

Compiling with O2 doesn't seem to help much. I think we should have a good look into this.

puffnfresh commented 7 years ago

I did a sampled profile. It spends 79 seconds (35%) of the time in checkForStackFrames. That's a problem.

It also spends 29% of the time in ghc.Classes$neInt.enter. That seems pretty high.

puffnfresh commented 7 years ago

I've chucked up the YourKit snapshot here:

https://brianmckenna.org/files/RunQueens-2017-01-23-shutdown-1.snapshot

rahulmutt commented 7 years ago

Thanks for tracking this!

I wonder if this particular queens implementation just happens to tickle the bottleneck in the implementation - Eta's alignment of the JVM stack w/ Eta stack.

The standard Haskell solution of the queens problem from nofib, runs in 2.2 seconds and after it reaches steady-state (classes are loaded, JIT kicks in) it comes to around 150ms, see this gist.

Running the exact implementation provided by Ingo with Java Microbenchmarking Harness (which is the proper way to do JVM benchmarks), we get this. The fact that it jumps from 280 seconds to 1 ms in just one iteration is something we should take a look at :) But the point here is, JIT wins and Eta has a horribly slow startup time, something we can improve over time, but nothing to worry about for now.

I've provided all the commands necessary to reproduce the results in the two Gists. It would be great if you could verify and let me know whether you reproduce the same thing.

puffnfresh commented 7 years ago

Well I honestly think this is amazing. I had to modify nofib slightly to get it to work on Linux but I was able reproduce it.

I couldn't believe it so I put the JAR onto the classpath of the Scala REPL:

2017-01-23-213057_1958x140_scrot

Ran it once and it took 5 minutes or so. Ran it again and I'm getting around 0.6 seconds to run the example.

@Ingo60 the real problem is we strictly load all module dependencies on class initialisation. We load lots and lots of code that we don't use. This is what takes the majority of the time.

I'm surprised we're beating GHC after we take away initalisation and add in the JIT. Is the Java JIT making Eta even faster than GHC!?

GHC	Eta
2.398s	0.551s

psibi commented 7 years ago

Wow, this is amazing. And that too without actually even trying to do any optimization from Eta side.

rahulmutt commented 7 years ago

@puffnfresh Even I had trouble believing it. I feel like the encoding we're doing in Eta is perfectly suited for JIT. And moreover, I think it's working in these small examples because a lot of the call sites end up monomorphic or bimorphic which the JIT can optimise really well, but we'll have to benchmark a real world application with huge sum types to see what happens.

We probably need ask some JVM expert to make sure the benchmarks are setup correctly because it may be the case that the measurements we're seeing may not be measuring what we think they're measuring.

rahulmutt commented 7 years ago

@psibi Right! The current codegen is rather naive in a lot of cases, so we pay for that in startup time right now. After we do more optimisations, it can reduce the startup time and increase the steady-state performance as well.

rahulmutt commented 7 years ago

Closing this, as we have more or less figured out the problem is the class loading. Any further discussion should happen on the lazy initialisation issue that I just opened.

Ingo60 commented 7 years ago

I wonder if this particular queens implementation just happens to tickle the bottleneck in the implementation - Eta's alignment of the JVM stack w/ Eta stack.

Sorry for that, Rahul! I promise it was my very first attempt at eta, and I choose this implementation specifically because it is not my own code.

puffnfresh commented 7 years ago

We're very wrong about this. @Ingo60 pointed out that running the problem with n=10 is about 10 seconds. It must not be mostly class-loading.

I think what's happening is a thunk is getting created. First run stores the result and successive runs are just reading the memoized value. I'm gonna test this.

puffnfresh commented 7 years ago

I modified main to look like:

main = do
  arg <- fmap listToMaybe getArgs
  let n = maybe 12 read arg
  putStr "12 Queens, "
  putStr (show $ length $ allSolutions n n)
  putStr " Solutions.\n"

And now each run looks something like:

2017-01-24-083245_735x68_scrot

So we're slow, not 5 minutes slow, but nowhere near GHC nor Frege.

rahulmutt commented 7 years ago

That looks a lot more reasonable, given the fact that all function calls happen via the stack and not through arguments (an optimisation we should implement soon). Thanks for the analysis! In nofib, all the programs get their params from command-line arguments so they don't suffer from the same problem as this particular benchmark when run multiple times. But as this example shows, we should be very careful with interpreting results since an evaluated thunk may make future executions faster.

jarekratajski commented 6 years ago

I've analyzed the case with simple tools. There seems to be 2 issues: If program is started with -XX:+PrintCompilation I see such log:

    178  260       3       ghc_prim.ghc.Types::DFalse (39 bytes)
compilation bailout: invalid parsing
    178  260       3       ghc_prim.ghc.Types::DFalse (39 bytes)   COMPILE SKIPPED: invalid parsing (retry at different tier)
    178  258       3       base.ghc.Base$id::apply1 (19 bytes)
    178  259       3       base.ghc.Base::id (6 bytes)
    178  261       3       ghc_prim.ghc.Types::DTrue (39 bytes)
compilation bailout: invalid parsing
    179  261       3       ghc_prim.ghc.Types::DTrue (39 bytes)   COMPILE SKIPPED: invalid parsing (retry at different tier)
    179  262       3       ghc_prim.ghc.Types::DZMZN (39 bytes)
compilation bailout: invalid parsing
    179  262       3       ghc_prim.ghc.Types::DZMZN (39 bytes)   COMPILE SKIPPED: invalid parsing (retry at different tier)
    179  267       3       base.ghc.Enum::$fEnum_Int_$csucc (23 bytes)
    179  275     n 0       java.lang.Object::getClass (native)   
    179  268       3       main.Main::okToAdd1 (45 bytes)
compilation bailout: invalid parsing
    179  268       3       main.Main::okToAdd1 (45 bytes)   COMPILE SKIPPED: invalid parsing (retry at different tier)
    179  246       3       eta.runtime.thunk.Thunk::blackHole (35 bytes)
    179  269       1       java.lang.reflect.Field::getDeclaringClass (5 byt

It means code cannot be compiled by hotspot - the reason is unbalanced synchronize. It is even consistent with strange decompilation result:

 public static Closure DZMZN() {
        if (DZMZN == null) {
            Class var0 = Types.class;
            synchronized(Types.class){} //????
            if (DZMZN == null) {
                DZMZN = new ZMZN();
            }
        }
        return DZMZN;
    }

I've analyzed bytecode and it looks ok.. the problem seems to be missing ExceptionTable.

//This is eta bytecode
  public static eta.runtime.stg.Closure DZMZN();
    Code:
       0: getstatic     #127                // Field DZMZN:Leta/runtime/stg/Closure;
       3: ifnull        9
       6: goto          35
       9: ldc           #3                  // class ghc_prim/ghc/Types
      11: dup
      12: astore_0
      13: monitorenter
      14: getstatic     #127                // Field DZMZN:Leta/runtime/stg/Closure;
      17: ifnull        23
      20: goto          33
      23: new           #129                // class ghc_prim/ghc/types/datacons/ZMZN
      26: dup
      27: invokespecial #131                // Method ghc_prim/ghc/types/datacons/ZMZN."<init>":()V
      30: putstatic     #127                // Field DZMZN:Leta/runtime/stg/Closure;
      33: aload_0
      34: monitorexit
      35: getstatic     #127                // Field DZMZN:Leta/runtime/stg/Closure;
      38: areturn

//this is same code as would produced by normal Javac
public static aa.Closure DZMZN();
    Code:
       0: getstatic     #2                  // Field DZMZN:Laa/Closure;
       3: ifnonnull     37
       6: ldc           #3                  // class aa/Types
       8: dup
       9: astore_0
      10: monitorenter
      11: getstatic     #2                  // Field DZMZN:Laa/Closure;
      14: ifnonnull     27
      17: new           #4                  // class aa/ZMZN
      20: dup
      21: invokespecial #5                  // Method aa/ZMZN."<init>":()V
      24: putstatic     #2                  // Field DZMZN:Laa/Closure;
      27: aload_0
      28: monitorexit
      29: goto          37
      32: astore_1
      33: aload_0
      34: monitorexit
      35: aload_1
      36: athrow
      37: getstatic     #2                  // Field DZMZN:Laa/Closure;
      40: areturn
    Exception table:
       from    to  target type
          11    29    32   any
          32    35    32   any

I am not sure... but it seems that lack of exception handling at bytecode level makes hotspot believing that monitorexit will not be performed. This part must be implemented (I've seen TODO in compiler part).

I've found it doing "sample based" profiling - mentioned functions are visible on top. Fix by removing synchronized section totally had improved program by ~20%. It is very likely that in case of a Single Threaded app the proper sync code should be also almost that faster.

Second thing that seems to slow down given implementation is Thunk.clear() - clearing of fields in Thunk. Also removing this part (which is stupid in case of a normal program ) showed significant speed up. ~15%. Seems for Thunks such fields could be generated during compilation - not obtained by reflection. However, cache of Fields from Reflection works. Only setField(this, null) in clear seems to be slow). This clear() part I am analyzing a little bit more (because it is quite strange observation... ).

rahulmutt commented 6 years ago

Thanks for the analysis @jarekratajski!

Regarding the incomplete synchronized block, it's known problem that's responsible for https://github.com/typelead/eta/issues/579 as well. The reason we use synchronized for lazy initialization of static thunks is to keep the promise of laziness - evaluating an expression exactly once. And for static values or constants, it's done to make sure values are shared.

There are other alternatives we can look at though: the holder pattern. What we can do is for each custom static thunk, we can generate a static field in the respective class and just simply reference that during initialization - it would significantly reduce the size of generated bytecode for initialization such that it may fall within the InlineSmallCode and get inlined faster (within C1 only).

Thunk.clear() was done as a temporary measure to stub out memory leaks that were being observed. The real solution is a proper implementation of #79 and #339 - efficient re-use of local variables.

rahulmutt commented 6 years ago

@jarekratajski Implemented #663 which takes care of the lazy holder pattern mentioned above. In the cases where it's not possible to implement that way, I fixed the "synchronized" bytecode generation so that it matches the monitors properly via the exception table. The observed perf increase for this benchmark with a single-shot cold run was ~10% on my system.

typelead / eta

Investigate slow queens problem solution #200

eta: 3:15.43 301928k

frege 0:34.14 463820k

ghc 0:03.33 5240k