terralang / terra

Terra is a low-level system programming language that is embedded in and meta-programmed by the Lua programming language.
terralang.org
Other
2.72k stars 201 forks source link

Compile time blows up with large array #427

Closed elliottslaughter closed 4 years ago

elliottslaughter commented 4 years ago

In the Regent project we recently found a case where compile times were blowing up with the use of a large array (https://github.com/StanfordLegion/legion/issues/724). Minimized and stripped of all Regent-specific logic, the reproducer is at the bottom of this post.

The fun part is, as best I can tell every part of this reproducer is necessary. The code listed below takes about 2.5 minutes to compile. But if you remove the dereference of buffer, it compiles instantly. If you remove the return it compiles instantly. If you don't assert_() it compiles instantly. If you assert_:setinlined(false) it compiles instantly.

The key seems to be doing a computation on the pointer passed in, and branching on it, in the same function as the pointer itself gets dereferenced (and returned). Presumably we're tripping over SROA in LLVM, or something like that.

local c = terralib.includecstring([[
#include <stdio.h>
#include <stdlib.h>
]])

struct anon100 {
  a : int[50000],
}

terra assert_(x : bool)
  if not x then
    c.printf("assertion failed\n")
  end
end

terra unpack_param(buffer : &opaque,
                   buffer_size : uint64) : int32[50000]
  var data_ptr1 : &uint8 = [&uint8](buffer) + [terralib.sizeof(int32[50000])]
  var result : int32[50000] = @[&int32[50000]](buffer)
  assert_(buffer_size == [uint64](data_ptr1 - [&uint8](buffer)))
  return result
end

unpack_param:setinlined(false)
print("about to compile unpack_param")
start_t = os.time()
unpack_param:compile()
stop_t = os.time()
print("finished compile unpack_param in " .. tostring(stop_t - start_t) .. " seconds")
elliottslaughter commented 4 years ago

A little bit of very unscientific debugging. Here are some backtraces from when I run. The main culprits seem to be SelectionDAGISel and StackSlotColoring in LLVM.

seemamirch commented 4 years ago

The LLVM IR generated by terra has a store op i.e. "store [50000 x i32] %4, [50000 x i32]* %0, align 4" -> store 50,000 i32 values. (and a similar load op) that is causing a large # of scalar store/move instructions to be generated and hence large compile times. A dump of the LLVM IR after each codegen pass (llc unpack_param.ll -print-after-all) points to the Expand ISel Pseudo-instructions pass that ends up generating these instructions. On the terra side, it could generate memcpy instructions to avoid the compilation issue - changing the LLVM IR manually resulted in a significant drop in compilation time.

elliottslaughter commented 4 years ago

This is fixed in #430 for LLVM versions >= 6, and partially fixed for < 6. The behavior with < 6 is a little odd, compile times are instantaneous when using saveobj, but still slow (though less bad) with :compile(). I'm inclined to accept this as success and move on. But it does mean that for the best compile performance, >= 6 is recommended.