invenia / Nabla.jl

A operator overloading, tape-based, reverse-mode AD
Other
67 stars 5 forks source link

Future Development #81

Open willtebbutt opened 6 years ago

willtebbutt commented 6 years ago

In the medium-long term, it makes little sense to continue to support a tape-based AD package when there is excellent work going on elsewhere in the Julia community; see Capstan, which is built on Cassette, which is intended to be the place to go for AD in Julia, and is a something like an extension of the functionality offered separately in ForwardDiff and ReverseDiff.

Nabla makes sense from our perspective in a world where ReverseDiff.jl is the other serious option, primarily because ReverseDiff.jl doesn't support the linear algebra optimisations that we require for our GP work and it's a pain to add them. However, the interface for defining custom sensitivities in Capstan is planned to be quite straightforward. This change in the Julia AD landscape can allow us to get the best of both worlds: a fully-featured tape-based AD package (the nuts-and-bolts of which we don't have to maintain) to which we can contribute the linear algebra optimisations that we require.

To this end I propose the following:

  1. Maintain Nabla.jl only until Capstan.jl is up and running and we have transitioned over to 1.0. At this point we will obviously keep Nabla around to support legacy code that we don't decide to port to 1.0 from 0.6, but best course of action for any differentiable code will be to port to 1.0.
  2. Separately package the finite differencing functionality that @wesselb developed. This is useful either for differentiating code written in another language that native Julia AD can't differentiate, or for unit / integration testing for AD packages, or for checking that code you're differentiating using an AD package is computing the derivatives correctly if you have concerns something is going wrong in a particular case. There already exists FiniteDiff, which I believe was developed to isolate some of the functionality from Calculus.jl, but it has less functionality that we have in Wessel's work; in particular Wessel added support for higher-order finite differencing which can yield much higher accuracy, and has a more sophisticated way of determining the step size parameter. FiniteDiff.jl also doesn't support the computation of directional derivatives for vector-valued functions, which is in some sense a more natural thing to do with finite differences. Furthermore, we've added a bunch of functionality that is particularly useful if you're using finite differencing for unit testing and AD packages. Conversely, FiniteDiff does support the computation of gradients and Hessians, whereas we only provide that functionality implicitly via the ability to compute directional derivatives. However, extending Wessel's work to do the same is really just a matter of adding some convenience functions to wrap our existing functionality, and would be very quick to do.
  3. Separately package our linear algebra optimisations (see pretty much everything in here). The idea would be to create the linear algebra counterpart to DiffRules.jl, which only supports functions of scalars.
  4. Import the packages from 2 and 3 into Nabla, and remove existing code.

The code for points 2 and 3 is, by design, already very well separated from the rest of the code in Nabla, so creating the two new packages is a quick job if someone sets up the repos (they will obviously need to be public immediately). The small extension of @wesselb 's finite differencing should only take an afternoon. There is an unbounded amount of work that could go into point 3 as there is a lot of scope for linear algebra optimisation in reverse-mode AD. The intention is to communicate with other people in the community to ensure that the library is useful for their purposes; as soon as it's up and running in a basic form I'll whack it on the Julia slack and get a conversation going about it. It's important that this happens sooner rather than later.

@omus as discussed previously, I will separate out the code associated with points 2 and 3 into separate submodules within Nabla once I've merged #78 and #79, and then we can sort out repos for the new packages.

jrevels commented 6 years ago

The idea would be to create the linear algebra counterpart to DiffRules.jl, which only support functions of scalars.

I'd love it if we could get DiffRules to support both - it shouldn't be that hard, I actually had a half-working implementation of it a long time ago, but the branch has since gone the way of the dinosaur...

AFAICT the only thing that blocks DiffRules from being used for mixed-scalar/tensor derivatives right now is simply that the rules don't have dispatch metadata attached to clarify which arguments are scalars and which are tensors. It shouldn't be too bad to support, for example,

@define_diffrule M.f(x::Tensor, y::Scalar) = ...
@define_diffrule M.f(x::Tensor, y::Tensor) = ...
⋮

Separately package the finite differencing functionality that @wesselb developed

Yes please! It would be so nice to have a package to replace Calculus for finite differencing.

willtebbutt commented 6 years ago

Apologies for taking ages to properly respond to this.

We've now separated out the finite differencing functionality, and the linear algebra ops. FDM is already registered on METADATA, and DiffLinearAlgebra will be shortly.

I'm definitely open to getting the linear algebra operations from DiffLinearAlgebra into DiffRules, but there are going to be a few things to think about in terms of the API; there are obviously more performance considerations to make when linear algebra is involved than there are in the scalar case. For example, this issue highlights that there are cases when it is possible + desirable to avoid recomputing certain intermediate quantities if the sensitivity w.r.t. to multiple outputs are requested. Similarly, there are certain mutating implementations of sensitivities that (I think) are only safe to use in particularly situations. I don't have workable solutions for either of these things a the minute, and I'm sure there are more, so I don't want to accidentally tie us in to a sub-optimal solution that could easily be avoided.

@jrevels what are your thoughts regarding the best way forward here? It would be great if we could get your thoughts on DiffLinearAlgebra at some point to determine what needs to happen before it would be useful for Capstan.

jrevels commented 6 years ago

I'm definitely open to getting the linear algebra operations from DiffLinearAlgebra into DiffRules

Hmm...after looking at DiffLinearAlgebra (great stuff, BTW!), it seems to be solving a very different set of problems than DiffRules. I probably wouldn't categorize it as "DiffRules for linear algebra."

IIUC, DiffLinearAlgebra is a library of hand-optimized, adjoint-based derivative kernels intended to be eagerly executed by downstream reverse-mode AD tools (for example, as part of tape evaluation).

In contrast, DiffRules provides a library of symbolic descriptions of derivative kernels. DiffRules doesn't provide any eager kernel implementations, and is in fact totally agnostic to, e.g. automatic differentiation (and choice of forward-mode, reverse-mode, etc.). The idea is that downstream tools can use DiffRules to "compile" their own eager kernel implementations, in whatever manner best suits the tool.

To facilitate this approach, DiffRules' rules are written out quite naively. For example, the rule for exp(x) is just exp(x). There's no hint that there's a redundant computation lurking there, but that's okay - downstream tools can (and should) simply perform a technique like CSE on any generated code (and indeed, ForwardDiff does this). In general, this approach allows us to treat optimizing AD kernels as a compilation problem, which is nice, since all kernels stand to benefit from new compilation-level optimizations.

My vision for Capstan is that it will tackle issues like intermediate state caching/recomputation, argument selection, and dependency analysis as compilation problems when generating individual kernel implementations. TBH, it's a pretty idealistic plan, but we'll see how far I get 😄 In that vein, DiffLinearAlgebra will certainly be useful to benchmark against.

willtebbutt commented 6 years ago

I agree with your first three paragraphs: the `DiffRules for linear algebra' statement is a bit misleading.

The reason that DiffLinearAlgebra currently doesn't provide a collection of symbolic expressions is historical; it's a minimal port from Nabla and I have no issue whatsoever transitioning over to symbolic expressions (where possible) if that would make this library more useful (doing so would probably actually make a number of implementation details cleaner and make testing easier).

That being said, whereas forward- and reverse-mode sensitivities of scalar operations can always be expressed in the form ∂f∂x ẋ or f̄ ∂f∂x respectively, meaning that its reasonable to provide an expression for ∂f∂x absent the context in which it will be used (hence DiffRules is a sensible thing to do), you would really need to be able to reference and when defining linear algebra expressions. This means that we would need a slightly more general interface for linear algebra, and would certainly require different forward- and reverse- mode expressions, than is currently provided by DiffRules.

Regarding the where possible statement above, there are certainly operations (such as the Cholesky factorisation) for which symbolic expressions might be a bit unweildy (see here). Now that I think about this a bit more, I'm not sure whether these larger implementations are going to be problematic or not.

Anyway, my point is that given symbolic expressions for the linear algebra operations I agree that it's reasonable to hope that compiler optimisations can eliminate redundant code when compiling custom kernel implementations, and that this is a significantly better idea than hand-coding lots of optimisations. (The issue I linked in my previous comment is a good example of this. I would definitely prefer to be able to just forget about this). However, I would contend that you simply can't handle linear algebra properly without a number of hand-coded symbolic expressions for the forward- and reverse-mode sensitivities because they aren't written in Julia. If at some point in the future we have native Julia implementation of (for example) LAPACK, then it would be a really good idea to try and produce an AD tool which is able to produce reasonably-well optimised kernels for each operation. To the best of my knowledge, we shouldn't expect this to happen any time soon (and almost certainly never for BLAS), so a symbolic version of the current implementation of DiffLinearAlgebra will be necessary for Capstan to be able to differentiate arbitrary Julia code even reasonably efficiently.

Anyway, if you're interested, perhaps we could move this type of discussion over to either the DiffRules or DiffLinearAlgebra repo and figure out what kind of changes are necessary to make the code here useful for other AD packages (i.e. Capstan)?