[Performance] Cholesky operation

awav commented 4 years ago

Hello everyone,

I have a concern about this line: https://github.com/tensorflow/swift-apis/blob/master/Sources/TensorFlow/Operators/Math.swift#L2712. In TensorFlow for python - gradients have access to the input and output of the operation, therefore there is no need in recomputing A=LLᵀ. The gradient signature normally looks like gradient(op, grad), where op has attributes like input and output.

But here, vector jacobian has to recompute decomposition once more time. Can it be avoided somehow?

PS: Actually, there are numerous examples where forward step value is necessary for gradient computation and this is quite important for computationally efficient gradient computation in general.

awav commented 4 years ago

Okay, looks like it is my misunderstanding of the @differentiable. The tuple returns a function for the gradient and first element is simply a forward step.

Feel free to close the issue, if that's the case :)

dan-zheng commented 4 years ago

Okay, looks like it is my misunderstanding of the @differentiable. The tuple returns a function for the gradient and first element is simply a forward step.

This is accurate!

// Given a function type `(T0, ...) -> U`
// (where `T0`, ..., `U` conform to the `Differentiable` protocol †),
//
// The VJP function typing rules are:
//
//  (T0, ...)      ->  (U,    (U.TangentVector)   -> (T0.TangentVector, ...))
//   ^                  ^      ^~~~~~~~~~~~~~~        ^~~~~~~~~~~~~~~~~~~~~
//  original args   result    derivative wrt result     derivative wrt args
//
// The derivative function returns a tuple of:
// - The original result of type `U`.
// - A "backpropagator" pullback function takes that the derivative with respect to
//   the result and returns derivatives with respect to arguments.
//
// †: only "wrt" parameters need to conform to `Differentiable`.

Derivative functions are defined as VJP functions taking original arguments rather than just the "returned pullback" functions. This allows the pullback function to capture and use intermediate values computed in the original function.

That's exactly what happens for cholesky: _vjpCholesky computes let decomposition = cholesky(x), and the pullback closure captures and uses decomposition rather than recomputing it.

Please read the "JVP and VJP functions" section from bit.ly/swift-autodiff-internals for more details!

awav commented 4 years ago

gosh, I love that...

dan-zheng commented 4 years ago

If you're interested in more cool differentiation stuff, check out our custom differentiation tutorial:

Stop gradients from propagating
Derivative surgery
Gradient checkpointing in 6 lines

Other cool features are on the roadmap!

Retroactive derivative registration (in-progress)
- Register derivatives for functions in other modules, like functions imported from C.
Default derivative implementations for protocol requirements
- e.g. so you'll never have to manually define a derivative for AdditiveArithmetic.+ or ElementaryFunctions.exp again for conforming types of those protocols

tensorflow / swift-apis

[Performance] Cholesky operation #568