Open archibate opened 4 years ago
@yuanming-hu Should we have distinct FrontendFuncBodyStmt and FuncBodyStmt?
I found FrontendWhileStmt differ from WhileStmt because: Expr cond
vs Stmt mask
.
Or maybe only FrontendFuncCallStmt is needed? Since Expr args[]
.
@yuanming-hu Should we have distinct FrontendFuncBodyStmt and FuncBodyStmt? I found FrontendWhileStmt differ from WhileStmt because:
Expr cond
vsStmt mask
. Or maybe only FrontendFuncCallStmt is needed? SinceExpr args[]
.
I think so. We will need FrontendFuncDefStmt
and FuncCallExpression
for the frontend, and FunctionDefStmt
and FuncCallStmt
for the middle-end.
Before start implementation, we should thoroughly discuss implementation plans. Introducing functions to Taichi IR is a lot of work and needs careful considerations.
Great! I'm adding FuncBodyStmt, with std::unique_ptr<Block> body
, like WhileStmt does, was this ok?
Found GL build error.Fixed after merging with master.
Describe the solution you'd like (if any)
- Add FuncBodyStmt & its parser, whose is_container_stmt = true, contains a scope/block.
- In OpenGL backend: emit GLSL function when visit(FuncBodyStmt).
- Add FuncCallStmt, then implement call in OpenGL backend.
- In LLVM backend: use llvm::Function for support, this can be harder than GL one.
- Add FuncReturnStmt for them and support return.
- Add
ti.inline
/ti.noinline
decorator, also may do detection if a function is better inlined.Additional comments I'm giving up #543, it begin from stage 4. This issue is also related to #536.
This basically makes sense to me!
Before we introduce the new IR, as you mentioned (https://github.com/taichi-dev/taichi/pull/612#issuecomment-600385473), we'd better consider where to store the definitions of the functions.
Currently, the Taichi IR only has a kernel body (which is a Block
). We'd better change this to separate function definitions from the main program.
I suggest we create a Module
class, with two members
functions
(std::vector<std::unique_ptr<Stmt>>
) to hold all the function definitionsmain
(std::unique_ptr<Block>
) to hold the original kernel body (without function definitions)New IR to be introduced:
FrontendFuncDefStmt
FrontendFuncRetStmt
FuncCallExpression
func (Stmt *)
: a raw pointer to FrontendFuncDefStmt
arguments (std::vector<Expr>)
: arguments(after AST lowering, )
FuncDefStmt
FuncRetStmt
FuncCallStmt
func (Stmt *)
: a raw pointer to FunctionDefStmt
arguments (std::vector<Stmt *>)
: input argumentsFuture steps:
ti.Vector/Matrix
More considerations:
I suggest we create a
Module
class, with two members
functions
(std::vector<std::unique_ptr<Stmt>>
) to hold all the function definitionsmain
(std::unique_ptr<Block>
) to hold the original kernel body (without function definitions)
You may also want main
to bestd::vector<std::unique_ptr<Stmt>>
for multiple kernel? Namely kernels
?
New IR to be introduced:
FrontendFuncDefStmt
FuncCallExpression
func (Stmt *)
: a raw pointer toFrontendFuncDefStmt
arguments (std::vector<Expr>)
: arguments
Currently I'm only doing FuncCallStmt, so no return value worrying me.
That makes sense - let me think about it.
(Taking a bath before it's too late. Will be back to this in 30 min)
You also want CompiledFunction
implemented some way:
plan A:
std::map<FuncId, std::unique_ptr<Stmt>>
for functions, where FuncId
contains name
and argtypes
, with operator==()
and __hash__()
equipped.
plan B:
std::map<std::string, std::unique_ptr<FuncDefStmt>>
for functions, where std::string
contains name
, and std::map<FuncArgInfo, std::unique_ptr<CompiledFunction>
in FuncDefStmt
.
plan C:
std::map<FuncId, std::unique_ptr<CompiledFunction>>
for functions, where FuncId
contains name
and argtypes
, with operator==()
and __hash__()
equipped.
Back. What does CompiledFunction
hold exactly?
I feel like this needs a lot of changes to be made and therefore its implementation plan must be considered very carefully. Taichi IR is the core of this project and I believe introducing function support to it worth the time. I suggest we postpone implementing this until things are mature.
Maybe it's also worth considering to finish https://github.com/taichi-dev/taichi/pull/583 and https://github.com/taichi-dev/taichi/pull/601 first.
Back. What does CompiledFunction hold exactly?
Potentially source_code for GL backend..
Back. What does CompiledFunction hold exactly?
Potentially source_code for GL backend..
I see. I guess probably it's easier to re-emit the source code during code-gen instead of saving it? I have to sleep now. Let's try to finalize the two PRs mentioned above first - Good night!
I've been following some works regarding IRs/lambda calculus supporting automatic differentiation as first-class concepts (e.g. Relay IR in TVM). This makes me wonder if we should consider the IR works at a (potentially much) larger scope.
Tape
is necessary)?But I'm not sure if my concerns here are valid. @yuanming-hu to comment on this..
A few references that I've found so far (I haven't understood them completely, nor are they close to be exhaustive):
I almost forgot the need for supporting autodiff here. Thanks for reminding me of this. Autodiffing functions wouldn't be too hard though. Given a function that takes (a, b, c, d)
and returns (x,y, z)
, we just need to compute d(x,y, z)/d(a, b, c, d)
using classic reverse-mode autodiff. Note that in Taichi functions (a, b, c, d, x, y, z)
is simply a tuple of scalar values.
- Such an IR might help us loose the Kernel Simplicity Rule? Or is this rule coming from the fact that the gradient computation happens across the Taichi kernels (i.e. where the
Tape
is necessary)?
The kernel simplicity rule happens because we currently do not support auto diff for-loops with mutable local variables that are carried through iterations. For example,
p = 1
for i in range(10):
p += p
ret[None] = p
https://github.com/taichi-dev/taichi/issues/581 added support for autodiff mutable variables within if
, but the case with for
needs implementation. The kernel simplicity rule is imposed to make sure users do not allocate mutable local vars outside the for loop, so that there're no mutable local vars that are carried in the for loops to be differentiated.
- Maybe such an IR will also enable us to compute derivatives of arbitrary order?
This can be a completely new research direction :-)
Thanks for the references. They seem very interesting. I did read part of these when I was designing DiffTaichi. I'll take a deeper look if necessary.
Relay Dev who designed&implemented the AD algorithm here. Arbitrary order of derivative is trivial as long as the output of ad can be feed back as the input of ad. One thing taichi should consider, if it want higher order ad, is to support forward mode ad, as the forward-over-reverse (do reverse for first order, then do forward mode ad for higher order) pattern is more efficient then doing reverse multiple time. Hessian vector product(which also require forward mode ad) is also useful as it is algorithmically more efficient then computing the Hessian. (TBH I am just subscribing myself to this thread, lol)
Thanks!
Could you also briefly explain the purpose of Let
? I'm under the impression that Let
converts the IR to be ANF form, which then enables the transformation operator (backpropagator) to work on closures or arbitrary functions?
I guess Taichi doesn't really need to support closures (in the short term). On the other hand, closure at Taichi level is not the same thing as that at the IR level. Would this support be a requirement for the transformation operator to support function defs or higher-order gradients? That is, maybe the operator would produce closures during the IR transformation?
Relay Dev who designed&implemented the AD algorithm here. Arbitrary order of derivative is trivial as long as the output of ad can be feed back as the input of ad. (TBH I am just subscribing myself to this thread, lol)
Welcome! :-) I agree it is trivial as long as our make_adjoint
pass (reverse-mode AD) can take its own output as input. However, this is not the case right now in Taichi IR. Compared to a more functional IR system designed for deep learning (Relay as a representative example, correct me if I misunderstand anything) Taichi IR has more branching and even mutable local variables. We also plan to support auto-diffing while
loops. To differentiate general SSA programs with control flows, we have to keep the computational history of the mutable local variables using stacks (following https://arxiv.org/pdf/1810.07951.pdf).
This means stack instructions will be inserted into the IR after autodiff. However, currently make_adjoint
does not process the stack instructions (which belongs to the output of the previous autodiff) correctly.
One thing taichi should consider, if it want higher order ad, is to support forward mode ad, as the forward-over-reverse (do reverse for first order, then do forward mode ad for higher order) pattern is more efficient then doing reverse multiple time.
This is interesting - are you talking about evaluating the contraction of the n
-th order gradient of L (which is a n
-order tensor) with an n-1
-th order tensor? If so I agree that we can use n-1 forward-mode AD plus 1 reverse-mode AD. I also agree that forward-mode AD is easier & cheaper than reverse-mode AD here.
For n=2 (Hessian vector prod as you mentioned, which I believe is the most common case), I think we can do either forward + reverse or reverse + forward.
Hessian vector product(which also require forward mode ad) is also useful as it is algorithmically more efficient then computing the Hessian.
Right, I think for some problems Taichi is trying to solve, we won't even have the space to store the Hessian matrix with O(#DOF^2)
entries. After implementing this, users can always assemble the whole Hessian using its columns provided by our Hessian vector autodiff.
Thanks for the discussions here. I think one possibility is to implement a forward-mode AD pass to that we can do Hessian vector product, which will enable the use of second-order optimizers.
The biggest difference between taichi and relay is that they are not even on the same level. Relay do not care about kernel - it treat all kernel as black box, and rely on tvm to optimize them. So, relay support diffing mutable variable and branch just like pytorch.
However, differentiating tvm kernel is hard. There is some pr working on it, but rn we just do it by hand.
If you want to support more kernel, I highly recommend reading the relevant chapter in Evaluating Derivatives. It cover way more cases (with more optimizations) then the Zygote paper. The halide autodiff paper is good too.
for Hessian Vector Product reverse then forward make more sense, as a function take multiple input and have single output, which make it amicable for reverse mode.
For higher order it is essentially Hessian Vector Product too, but when going forward mode ad, instead of the classical dual number the right hand side is a taylor series truncated to nth order (classical dual number is just right hand side truncated to first order).
Currently, the experimental real function in Taichi just delays the inline process to C++, and it's not a real real function. I am now working on it to make it real. I will start on the LLVM backend, and I will use this issue to track the process.
Here's what I plan to do:
ti.func
instead of inlining it.
Future steps:
There are also some problems to be discussed:
How should we determine the outermost for
?
In the current Taichi, we parallelize the outermost for
after inlining. That is, if the outermost for
is inside a function, we are able to parallelize it. However, if we implement real function, the caller and callee do not have information about each other, and it is hard to determine whether we should parallelize a for
.
for
in a kernel, and never parallelize for
s in a function.for
, and one not parallelizing. The codegen maintains a status on whether a for
can be parallelized in the current scope, and makes call to one of the two versions.I vote for [i], because it also let users easier to determine whether a for
will be parallelized.
Should we use a different decorator (like @ti.real_func
) for the real function?
When implementing real function, we may add limits on arguments and return values. We may also change the behaviour of for
s like the simple way above. These limits and changes may break some of the user codes, so I think maybe we should use a different decorator than @ti.func
.
TODOs:
It's great to see this finally being picked up again!
I vote for [i], because it also let users easier to determine whether a for will be parallelized.
+1
Should we use a different decorator (like @ti.real_func) for the real function?
I think it's OK to have an experimental one, but not in the top level. Either ti.wip.real_func
or ti._real_func
sounds fine to me. However, once this is stable enough to be released, I do think that we should have a single decorator. Inlining or not is just one aspect to control how a function is generated, there could very well be other aspects. We might have a "function decorator" for this purpose. For example,
@ti.func_config(inline=True, opt_level=2)
@ti.func
def my_awesome_func():
...
cc: @bsavery
I'd be very happy with [I] only the kernel level be parallel.
Concisely describe the proposed feature I would like to add real function support so that no more IR spam space wasting and finally support recursion.
Describe the solution you'd like (if any)
ti.inline
/ti.noinline
decorator, also may do detection if a function is better inlined.Additional comments I'm giving up #543, it begin from stage 4. This issue is also related to #536.