Exception resume fix - Githubissues

Dan12 commented 5 years ago

It seems like whatever compiles the .ll files ignores the resume block and doesn't insert the proper call the unwind.h library. This makes the explicit call to the correct function.

dz333 commented 5 years ago

I can confirm that this is also failing on my (Linux) build. Before merging I just want to double check and see why the built-in LLVM resume isn't working properly.

Dan12 commented 5 years ago

Yeah, I would also be curious to see why that is because according to the LLVM user guide it should just work. When I disassembled, it seemed like it was just falling through to the next basic block or just returning.

dz333 commented 5 years ago

After spending some time looking at this, it seems that our exception implementation is a little confused and half uses LLVM features properly and half doesn't. I suspect that since we're only partially implementing this API that is the reason the resume doesn't work. (e.g. we aren't properly using multi exception type indexing and have worked around it with weird 'rethrow' semantics.)

If you're going to be working on exception implementation in the future, feel free to rethink how it is implemented and properly implement the LLVM-style Try-Catch-Finally. (https://releases.llvm.org/5.0.0/docs/ExceptionHandling.html)

andrewcmyers commented 5 years ago

From my conversations with Daniel, it sounds like the LLVM exception support is actually quite expensive. It may actually be better to figure out some other approach entirely, though some fighting against LLVM may be entailed. Maybe there's a paper in figuring out how LLVM should have supported exceptions...

dz333 commented 5 years ago

That is also reasonable - either way, we should probably not be in this "half-in" / "half-out" state that we're in right now.

gharrma commented 5 years ago

Sorry to barge in on this, but here are some thoughts off the top of my head in case any of them are useful!

(1) The reason the exception implementation does a 'rethrow' instead of 'resume' in some cases is because: in C++, if an exception is thrown while another is already in flight (i.e., if an exception is thrown from a destructor participating in a stack unwind), then the program will simply abort(!). This is in stark contrast to Java, where finally-blocks---which are the analogue to C++ destructors---are allowed to throw new exceptions that replace the previous one in flight. Unfortunately, LLVM implements the C++ behavior. So, when there is a finally-block, we have to catch the exception outright, run the finally-block, then re-throw. JLangTryExt.java:168 has a comment explaining this.

(2) Aside from (1), is there any other sense in which we deviate from proper LLVM exception handling? I'm not sure what you mean by "multi exception type indexing," but I remember being pretty confident in our conformance to the LLVM exception API.

(3) Please feel free to post the .ll file corresponding to the failing test case here (before the fix). I might still remember enough to spot something.

(4) Just for my own curiosity: why do you consider LLVM exceptions to be expensive? They are zero-cost in the usual sense for the no-exception path. I admit I don't know too much about the costs incurred when an exception is actually thrown, but I would guess it's implemented in the same way that most compilers do it (stack walking with exception side-tables).

Also, hi everyone! :)

Dan12 commented 5 years ago

Here is the relevant code snippet. This would print "Caught runtime exception" and then "Should not Print". I can confirm that the code does enter the catch.next label but I'm not sure where it goes afterward. Also, we compared popping a stack frame with exceptions to simply returning and found it to be ~100 times slower (so a few hundred assembly instructions per exception thrown).

continue7:                                        ; preds = %load.class6, %continue2
  invoke void @Polyglot_ExceptionsNested_catchesRuntimeExn__()
          to label %invoke.cont unwind label %lpad.catch, !dbg !19

lpad.catch:                                       ; preds = %invoke.cont17, %continue12, %load.class11, %continue7
  %lpad.catch.res = landingpad { i8*, i32 }
          catch i8* @Polyglot_java_lang_Exception_class_id, !dbg !15
  %exn = extractvalue { i8*, i32 } %lpad.catch.res, 0, !dbg !15
  %sel = extractvalue { i8*, i32 } %lpad.catch.res, 1, !dbg !15
  %catch.matches = icmp eq i32 %sel, 1, !dbg !15
  br i1 %catch.matches, label %catch.java.lang.Exception, label %catch.next, !dbg !15

try.end:                                          ; preds = %continue22, %invoke.cont19
  ret void, !dbg !14

invoke.cont:                                      ; preds = %continue7
  %class13 = load %class.java_lang_Class*, %class.java_lang_Class** @Polyglot_java_lang_System_class, !dbg !20
  %class.null14 = icmp eq %class.java_lang_Class* %class13, null, !dbg !20
  br i1 %class.null14, label %load.class11, label %continue12, !dbg !20

load.class11:                                     ; preds = %invoke.cont
  %call16 = invoke %class.java_lang_Class* @Polyglot_java_lang_System_load_class()
          to label %invoke.cont15 unwind label %lpad.catch, !dbg !20

continue12:                                       ; preds = %invoke.cont15, %invoke.cont
  %load.out = load %class.java_io_PrintStream*, %class.java_io_PrintStream** @Polyglot_java_lang_System_out, !dbg !20
  invoke void @InternStringLit({ %cdv_ty.java_lang_String*, i8*, %class.jlang_runtime_Array*, i16 }* @_string_lit_83_104_111_117_108_100_32_110_111_116_32_112_114_105_110_116_)
          to label %invoke.cont17 unwind label %lpad.catch, !dbg !21

invoke.cont15:                                    ; preds = %load.class11
  br label %continue12, !dbg !20

invoke.cont17:                                    ; preds = %continue12
  %gep = getelementptr %class.java_io_PrintStream, %class.java_io_PrintStream* %load.out, i32 0, i32 0, !dbg !20
  %load.dv = load %cdv_ty.java_io_PrintStream*, %cdv_ty.java_io_PrintStream** %gep, !dbg !20
  %gep18 = getelementptr %cdv_ty.java_io_PrintStream, %cdv_ty.java_io_PrintStream* %load.dv, i32 0, i32 3, i32 44, !dbg !20
  %load.dv.method = load void (%class.java_io_PrintStream*, %class.java_lang_String*)*, void (%class.java_io_PrintStream*, %class.java_lang_String*)** %gep18, !dbg !20
  invoke void %load.dv.method(%class.java_io_PrintStream* %load.out, %class.java_lang_String* bitcast ({ %cdv_ty.java_lang_String*, i8*, %class.jlang_runtime_Array*, i16 }* @_string_lit_83_104_111_117_108_100_32_110_111_116_32_112_114_105_110_116_ to %class.java_lang_String*))
          to label %invoke.cont19 unwind label %lpad.catch, !dbg !20

invoke.cont19:                                    ; preds = %invoke.cont17
  br label %try.end, !dbg !15

catch.java.lang.Exception:                        ; preds = %lpad.catch
  call void @llvm.dbg.declare(metadata %class.java_lang_Exception** %e, metadata !22, metadata !17), !dbg !24
  %call20 = call i8* @extractJavaExceptionObject(i8* %exn), !dbg !15
  %cast = bitcast i8* %call20 to %class.java_lang_Exception*, !dbg !15
  store %class.java_lang_Exception* %cast, %class.java_lang_Exception** %e, !dbg !15
  %class23 = load %class.java_lang_Class*, %class.java_lang_Class** @Polyglot_java_lang_System_class, !dbg !25
  %class.null24 = icmp eq %class.java_lang_Class* %class23, null, !dbg !25
  br i1 %class.null24, label %load.class21, label %continue22, !dbg !25

catch.next:                                       ; preds = %lpad.catch
  resume { i8*, i32 } %lpad.catch.res

load.class21:                                     ; preds = %catch.java.lang.Exception
  %call25 = call %class.java_lang_Class* @Polyglot_java_lang_System_load_class(), !dbg !25
  br label %continue22, !dbg !25

gharrma commented 5 years ago

Hm, can you post the whole .ll file? IICU it should not even be printing "Caught runtime exception," which means the bug is in the translation for catchesRuntimeExn(), not main(). (Or both, who knows.)

gharrma commented 5 years ago

Was able to reproduce locally, and I too am really confused about what's going on in that test case. So, I'll stop meddling for now :)

dz333 commented 5 years ago

Thanks for your input Matt! You definitely have a lot more familiarity with the code base and the plan for how this was supposed to work so your thoughts are very helpful. Feel free to chime in whenever you want! As far as the 'multi-catch' goes -> I meant that we weren't using the landingpad <type> ... support. All of our exceptions are "Java Exceptions" and then we insert extra code to check the class type. I thought that checking code did something extra and weird (with rethrowing excepitons), but upon re-reading some of the generated *.ll files it looks reasonable to me.

I'm not sure if you considered trying to use the LLVM support for multi-catch and decided against it or not - definitely works the way it is but might be cleaner / more efficient to use it if there isn't some complication that I'm missing.

Dan12 commented 5 years ago

Even if you use multi-catch, the landing pad will still be entered in the 2nd cleanup phase and you need to resume exception propagation (the way it signals cleanup only is to set the select part of the exception struct to 0). When I was investigating why resume didn't work, I looked at the LLVM exception example: https://github.com/llvm-mirror/llvm/blob/master/examples/ExceptionDemo/ExceptionDemo.cpp

In there I am pretty sure they explicitly make the call to _Unwind_Resume instead of just generating a resume instruction but I don't know why.

gharrma commented 5 years ago

Ah, I think there's some confusion about 'multi-catch', because we actually do use landingpad <type> ...; here's an example emitted from the Exceptions.java test case:

lpad.catch:
  %lpad.catch.res = landingpad { i8*, i32 }
          catch i8* @Polyglot_java_lang_Exception_class_id
          catch i8* @Polyglot_java_lang_Error_class_id, !dbg !129

Note, runtime/native/exception.cpp line 306 is where we make the call to instanceof to check whether the exception currently in-flight matches some catch clause declared in a landingpad instruction in a given frame. I.e., the exception class type is checked by the stack walker, not by the generated LLVM! The 'extra code' you saw after the landingpad in the generated LLVM is actually comparing a simple integer index returned by the stack walker indicating which catch clause it found to be matched.

(I wish LLVM would just let you declare a label for each catch clause and have the stack walker jump directly there instead of handing you back an index---but, you will see the same pattern in LLVM generated from C++ code. It's very strange.)

Anyway, our exception implementation should do the right thing: if none of the catch clauses in a try-catch block match an in-flight exception (and if there is no finally-block), then the stack walker will know it (due to our instanceof checks in exception.cpp) and should skip right over that frame entirely. It should not jump to the landingpad corresponding to that try-catch block, and thus it should not even get to the resume there at all.

(Now, the next good question to ask is: "If the stack-walker only stops at try-catch blocks for which a catch clause matches, then why do we ever need to emit a resume instruction? After all, if a catch clause does match, then the exception is no longer in-flight and should not need to be resumed." The answer is: I have no idea. But I recall reading somewhere that the catch-clause-index-checking code after a landing pad should be prepared to handle a situation in which the stack walker jumps there even when no catch clauses have been matched. In fact, you will see the same thing in the LLVM IR for C++: it always has a backup resume branch just in case. My best guess is that this might be related to being robust against optimizations which move your landing pads around, but I'm not sure.)

Of course, the failing test case here completely contradicts what I said above about "skipping right over frames that don't have a matching catch clause," and that's the part I'm quite confused about. When debugging I was seeing some completely inexplicable control flow and had to give to give up eventually :)

By the way, I should mention that I've never really understood the code in runtime/native/exception.cpp---I think we got that code from an LLVM tutorial/demo and then tweaked it slightly to do our instanceof checks. I wonder whether the exception bug might lie there rather than in the generated IR. After all, that's the code that manages the stack walker's search phase, cleanup phase, etc., and determines which landing pads to stop at. It also does things like directly set the instruction pointer, which might explain some of the "inexplicable control flow" I was seeing...

(Also, regarding performance, note that exception.cpp probably plays a significant role in the performance of exception handling. It's possible that the C++ stack walker is way more efficient than ours, for example, or that the instanceof checks are the bottleneck in our stack walker.)

Anyway, good luck! I'm all for re-writing or tweaking the exception implementation if it fixes things.

Dan12 commented 5 years ago

I think we still have to enter the landing pad because we may have finally blocks that need to do some cleanup as the stack is unwound. Also, I think our exception.cpp is based on the C++ stack walker code so I'm not sure if we can look there for performance gains.

The way I understand it is that the itanium exception handling mechanism does 2 passes: a search and a cleanup pass. LLVM even admits that this isn't necessary even for C++ exceptions. The cleanup pass needs to enter all landing pads as it unwinds stack frames and it passes a pointer to the exception that is being propegated and a selector. Typically, the selector corresponds to the catch clause which is supposed to handle the exception, so you should be able to do a switch statement at the landing pad and jump to the correct catch handler. The selector can also be 0, which means that this landing pad should do cleanup only (so execute the finally clause) and then resume the exception with a call to _Unwind_Resume. I suspect that since LLVM want's to be agnostic to which unwinding library you use, it can't actually generate the call to _Unwind_Resume and implements some wierd default translation for resume.

gharrma commented 5 years ago

When there is a finally block, we do emit a catch-all catch i8* null clause for the landingpad to ensure that we always stop at that landing pad. (We use catch i8* null rather than a cleanup clause, because LLVM does not allow cleanup blocks to throw new exceptions---which we must support for Java finally-blocks.)

Perhaps you are right about stopping at every landing pad in the cleanup phase, though. The documentation for the cleanup clause of landingpad says, "This clause means that the landingpad should always be entered." So I had assumed that if there is no cleanup clause (and no catch-all catch i8* null clause), then the stack walker would just skip over it. But I'm realizing that I never checked to verify that! And what you're saying sounds very plausible.

Perhaps the direct call to _Unwind_Resume really is the best solution here.

Dan12 commented 5 years ago

Ok, the rethrow makes sense. Yeah, it looks like the code was written assuming that resume would essentially call _Unwind_Resume but it turns out it doesn't actually do that so it does seem like the right thing to do is just the direct call.

polyglot-compiler / JLang

Exception resume fix #43