taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.51k stars 2.28k forks source link

[IR] Function definition/function call support in Taichi IR #602

Open archibate opened 4 years ago

archibate commented 4 years ago

Concisely describe the proposed feature I would like to add real function support so that no more IR spam space wasting and finally support recursion.

Describe the solution you'd like (if any)

  1. Add FuncBodyStmt & its parser, whose is_container_stmt = true, contains a scope/block.
  2. In OpenGL backend: emit GLSL function when visit(FuncBodyStmt).
  3. Add FuncCallStmt, then implement call in OpenGL backend.
  4. In LLVM backend: use llvm::Function for support, this can be harder than GL one.
  5. Add FuncReturnStmt for them and support return.
  6. Add ti.inline/ti.noinline decorator, also may do detection if a function is better inlined.

Additional comments I'm giving up #543, it begin from stage 4. This issue is also related to #536.

archibate commented 4 years ago

@yuanming-hu Should we have distinct FrontendFuncBodyStmt and FuncBodyStmt? I found FrontendWhileStmt differ from WhileStmt because: Expr cond vs Stmt mask. Or maybe only FrontendFuncCallStmt is needed? Since Expr args[].

yuanming-hu commented 4 years ago

@yuanming-hu Should we have distinct FrontendFuncBodyStmt and FuncBodyStmt? I found FrontendWhileStmt differ from WhileStmt because: Expr cond vs Stmt mask. Or maybe only FrontendFuncCallStmt is needed? Since Expr args[].

I think so. We will need FrontendFuncDefStmt and FuncCallExpression for the frontend, and FunctionDefStmt and FuncCallStmt for the middle-end.

Before start implementation, we should thoroughly discuss implementation plans. Introducing functions to Taichi IR is a lot of work and needs careful considerations.

archibate commented 4 years ago

Great! I'm adding FuncBodyStmt, with std::unique_ptr<Block> body, like WhileStmt does, was this ok?

archibate commented 4 years ago

Found GL build error.Fixed after merging with master.

yuanming-hu commented 4 years ago

Describe the solution you'd like (if any)

  1. Add FuncBodyStmt & its parser, whose is_container_stmt = true, contains a scope/block.
  2. In OpenGL backend: emit GLSL function when visit(FuncBodyStmt).
  3. Add FuncCallStmt, then implement call in OpenGL backend.
  4. In LLVM backend: use llvm::Function for support, this can be harder than GL one.
  5. Add FuncReturnStmt for them and support return.
  6. Add ti.inline/ti.noinline decorator, also may do detection if a function is better inlined.

Additional comments I'm giving up #543, it begin from stage 4. This issue is also related to #536.

This basically makes sense to me!

Before we introduce the new IR, as you mentioned (https://github.com/taichi-dev/taichi/pull/612#issuecomment-600385473), we'd better consider where to store the definitions of the functions.

Currently, the Taichi IR only has a kernel body (which is a Block). We'd better change this to separate function definitions from the main program.

Pre-mature thoughts

I suggest we create a Module class, with two members

New IR to be introduced:

(after AST lowering, )

Future steps:

More considerations:

archibate commented 4 years ago

I suggest we create a Module class, with two members

  • functions (std::vector<std::unique_ptr<Stmt>>) to hold all the function definitions
  • main (std::unique_ptr<Block>) to hold the original kernel body (without function definitions)

You may also want main to bestd::vector<std::unique_ptr<Stmt>> for multiple kernel? Namely kernels?

New IR to be introduced:

  • FrontendFuncDefStmt
  • FuncCallExpression
    • func (Stmt *): a raw pointer to FrontendFuncDefStmt
    • arguments (std::vector<Expr>): arguments

Currently I'm only doing FuncCallStmt, so no return value worrying me.

yuanming-hu commented 4 years ago

That makes sense - let me think about it.

(Taking a bath before it's too late. Will be back to this in 30 min)

archibate commented 4 years ago

You also want CompiledFunction implemented some way:

plan A: std::map<FuncId, std::unique_ptr<Stmt>> for functions, where FuncId contains name and argtypes, with operator==() and __hash__() equipped.

plan B: std::map<std::string, std::unique_ptr<FuncDefStmt>> for functions, where std::string contains name, and std::map<FuncArgInfo, std::unique_ptr<CompiledFunction> in FuncDefStmt.

plan C: std::map<FuncId, std::unique_ptr<CompiledFunction>> for functions, where FuncId contains name and argtypes, with operator==() and __hash__() equipped.

yuanming-hu commented 4 years ago

Back. What does CompiledFunction hold exactly?

yuanming-hu commented 4 years ago

I feel like this needs a lot of changes to be made and therefore its implementation plan must be considered very carefully. Taichi IR is the core of this project and I believe introducing function support to it worth the time. I suggest we postpone implementing this until things are mature.

Maybe it's also worth considering to finish https://github.com/taichi-dev/taichi/pull/583 and https://github.com/taichi-dev/taichi/pull/601 first.

archibate commented 4 years ago

Back. What does CompiledFunction hold exactly?

Potentially source_code for GL backend..

yuanming-hu commented 4 years ago

Back. What does CompiledFunction hold exactly?

Potentially source_code for GL backend..

I see. I guess probably it's easier to re-emit the source code during code-gen instead of saving it? I have to sleep now. Let's try to finalize the two PRs mentioned above first - Good night!

k-ye commented 4 years ago

I've been following some works regarding IRs/lambda calculus supporting automatic differentiation as first-class concepts (e.g. Relay IR in TVM). This makes me wonder if we should consider the IR works at a (potentially much) larger scope.

But I'm not sure if my concerns here are valid. @yuanming-hu to comment on this..

A few references that I've found so far (I haven't understood them completely, nor are they close to be exhaustive):

yuanming-hu commented 4 years ago

I almost forgot the need for supporting autodiff here. Thanks for reminding me of this. Autodiffing functions wouldn't be too hard though. Given a function that takes (a, b, c, d) and returns (x,y, z), we just need to compute d(x,y, z)/d(a, b, c, d) using classic reverse-mode autodiff. Note that in Taichi functions (a, b, c, d, x, y, z) is simply a tuple of scalar values.

  • Such an IR might help us loose the Kernel Simplicity Rule? Or is this rule coming from the fact that the gradient computation happens across the Taichi kernels (i.e. where the Tape is necessary)?

The kernel simplicity rule happens because we currently do not support auto diff for-loops with mutable local variables that are carried through iterations. For example,

p = 1
for i in range(10):
  p += p

ret[None] = p

https://github.com/taichi-dev/taichi/issues/581 added support for autodiff mutable variables within if, but the case with for needs implementation. The kernel simplicity rule is imposed to make sure users do not allocate mutable local vars outside the for loop, so that there're no mutable local vars that are carried in the for loops to be differentiated.

  • Maybe such an IR will also enable us to compute derivatives of arbitrary order?

This can be a completely new research direction :-)

Thanks for the references. They seem very interesting. I did read part of these when I was designing DiffTaichi. I'll take a deeper look if necessary.

MarisaKirisame commented 4 years ago

Relay Dev who designed&implemented the AD algorithm here. Arbitrary order of derivative is trivial as long as the output of ad can be feed back as the input of ad. One thing taichi should consider, if it want higher order ad, is to support forward mode ad, as the forward-over-reverse (do reverse for first order, then do forward mode ad for higher order) pattern is more efficient then doing reverse multiple time. Hessian vector product(which also require forward mode ad) is also useful as it is algorithmically more efficient then computing the Hessian. (TBH I am just subscribing myself to this thread, lol)

k-ye commented 4 years ago

Thanks!

Could you also briefly explain the purpose of Let? I'm under the impression that Let converts the IR to be ANF form, which then enables the transformation operator (backpropagator) to work on closures or arbitrary functions?

I guess Taichi doesn't really need to support closures (in the short term). On the other hand, closure at Taichi level is not the same thing as that at the IR level. Would this support be a requirement for the transformation operator to support function defs or higher-order gradients? That is, maybe the operator would produce closures during the IR transformation?

yuanming-hu commented 4 years ago

Relay Dev who designed&implemented the AD algorithm here. Arbitrary order of derivative is trivial as long as the output of ad can be feed back as the input of ad. (TBH I am just subscribing myself to this thread, lol)

Welcome! :-) I agree it is trivial as long as our make_adjoint pass (reverse-mode AD) can take its own output as input. However, this is not the case right now in Taichi IR. Compared to a more functional IR system designed for deep learning (Relay as a representative example, correct me if I misunderstand anything) Taichi IR has more branching and even mutable local variables. We also plan to support auto-diffing while loops. To differentiate general SSA programs with control flows, we have to keep the computational history of the mutable local variables using stacks (following https://arxiv.org/pdf/1810.07951.pdf).

This means stack instructions will be inserted into the IR after autodiff. However, currently make_adjoint does not process the stack instructions (which belongs to the output of the previous autodiff) correctly.

One thing taichi should consider, if it want higher order ad, is to support forward mode ad, as the forward-over-reverse (do reverse for first order, then do forward mode ad for higher order) pattern is more efficient then doing reverse multiple time.

This is interesting - are you talking about evaluating the contraction of the n-th order gradient of L (which is a n-order tensor) with an n-1-th order tensor? If so I agree that we can use n-1 forward-mode AD plus 1 reverse-mode AD. I also agree that forward-mode AD is easier & cheaper than reverse-mode AD here.

For n=2 (Hessian vector prod as you mentioned, which I believe is the most common case), I think we can do either forward + reverse or reverse + forward.

Hessian vector product(which also require forward mode ad) is also useful as it is algorithmically more efficient then computing the Hessian.

Right, I think for some problems Taichi is trying to solve, we won't even have the space to store the Hessian matrix with O(#DOF^2) entries. After implementing this, users can always assemble the whole Hessian using its columns provided by our Hessian vector autodiff.

Thanks for the discussions here. I think one possibility is to implement a forward-mode AD pass to that we can do Hessian vector product, which will enable the use of second-order optimizers.

MarisaKirisame commented 4 years ago

The biggest difference between taichi and relay is that they are not even on the same level. Relay do not care about kernel - it treat all kernel as black box, and rely on tvm to optimize them. So, relay support diffing mutable variable and branch just like pytorch.

However, differentiating tvm kernel is hard. There is some pr working on it, but rn we just do it by hand.

If you want to support more kernel, I highly recommend reading the relevant chapter in Evaluating Derivatives. It cover way more cases (with more optimizations) then the Zygote paper. The halide autodiff paper is good too.

for Hessian Vector Product reverse then forward make more sense, as a function take multiple input and have single output, which make it amicable for reverse mode.

For higher order it is essentially Hessian Vector Product too, but when going forward mode ad, instead of the classical dual number the right hand side is a taylor series truncated to nth order (classical dual number is just right hand side truncated to first order).

lin-hitonami commented 2 years ago

Currently, the experimental real function in Taichi just delays the inline process to C++, and it's not a real real function. I am now working on it to make it real. I will start on the LLVM backend, and I will use this issue to track the process.

Here's what I plan to do:

Future steps:

There are also some problems to be discussed:

  1. How should we determine the outermost for?

    In the current Taichi, we parallelize the outermost for after inlining. That is, if the outermost for is inside a function, we are able to parallelize it. However, if we implement real function, the caller and callee do not have information about each other, and it is hard to determine whether we should parallelize a for.

    1. A simple way is: Only parallelize the outermost for in a kernel, and never parallelize fors in a function.
    2. A more complicate way is: Generate two versions of a function, one parallelizing the outermost for, and one not parallelizing. The codegen maintains a status on whether a for can be parallelized in the current scope, and makes call to one of the two versions.

    I vote for [i], because it also let users easier to determine whether a for will be parallelized.

  2. Should we use a different decorator (like @ti.real_func) for the real function?

    When implementing real function, we may add limits on arguments and return values. We may also change the behaviour of fors like the simple way above. These limits and changes may break some of the user codes, so I think maybe we should use a different decorator than @ti.func.

TODOs:

k-ye commented 2 years ago

It's great to see this finally being picked up again!

I vote for [i], because it also let users easier to determine whether a for will be parallelized.

+1

Should we use a different decorator (like @ti.real_func) for the real function?

I think it's OK to have an experimental one, but not in the top level. Either ti.wip.real_func or ti._real_func sounds fine to me. However, once this is stable enough to be released, I do think that we should have a single decorator. Inlining or not is just one aspect to control how a function is generated, there could very well be other aspects. We might have a "function decorator" for this purpose. For example,

@ti.func_config(inline=True, opt_level=2)
@ti.func
def my_awesome_func():
  ...

cc: @bsavery

bsavery commented 2 years ago

I'd be very happy with [I] only the kernel level be parallel.