Hyperdual support - Githubissues

benruijl commented 5 years ago

I have written an N-dimensional dual number class (a + b ep1 + c ep2 +...), extending dual_num. This is useful, since all partial derivatives can be computed at once this way. I'll do some more testing, some clean-up, and eventually I'll make a pull request.

You can have a look and try it out already: https://github.com/benruijl/dual_num

ChristopherRabotin commented 5 years ago

Looks interesting! So, is this an implementation of multidual numbers? I think I read that these can be used for multiple derivatives.

Note that for computing matrices of partial derivatives, one can create a matrix of dual numbers and pass that to the partials function of this package, as I've done here.

Moving forward, I think that if we can confirm this implements multidual numbers, we ought to update the library such that two types are defined: Multidual<T, N> and type Dual<T> = Multidual<T, U2>. This should enable easy use of "simple" dual numbers, as today, while also extending to multidual numbers.

benruijl commented 5 years ago

Yes, this should be a multidual number (although admittedly I haven't really read much about it). I am using an earlier version of this code extensively for my work (I see you are a physicist too!) to compute jacobians and it works. You can also check out the example in the Readme file.

In my branch I've already implemented the type definition for Dual<T> (at the bottom of lib.rs) 😊

ChristopherRabotin commented 5 years ago

Ah indeed, I hadn't scrolled down to the bottom of lib.rs. Here's a link to the diff for comparison.

(I try to be a physicist, and take your comment as compliment, but I'm more of an engineer ;-) )

novacrazy commented 5 years ago

Oh wow, very nice. I'll be honest and admit this isn't my specialty, so I can't say much about the math itself.

However, I will try to keep an eye on this and monitor the API design. So far it looks pretty good.

benruijl commented 5 years ago

The maths of my implementation actually isn't that hard: each component is indepedent, so you should just apply the same transformation to all N dual components.

However, the difference lies in the input, where you associate a variable x with a 1 in the first dual component and a variable y with a 1 in the second. This will give you the partial derivatives in one computation (see example code) . This setup resulted in a factor 2 speedup for my work code.

ChristopherRabotin commented 5 years ago

@benruijl after a quick skim through the code, I think you've implemented hyperdual numbers and not multidual numbers.

Hyperdual numbers are such that you define a space of size N+1 where there are N dual numbers, \epsilon_0 through \epsilon_n. \epsilon_i * \epsilon_j != 0 iif i != j, and \epsilon_i ^ 2 = 0. (I have attached a more formal definition -- this is an excerpt from a paper I'm trying to get published).

Multidual numbers are slightly different. They are based on multicomplex numbers. You define space of size N+1 where there are N multidual numbers, also \epsilon_0 through \epsilon_n. However, \epison_i^2 = 0 iif i = 2. In fact, the rule is: \epsilon_i ^ i = 0 and \epsilon_i ^ k != 0.

Nevertheless, if we confirm that this allows for embedded implementation of hyperdual numbers in the library, we should definitely include it in the code. Great work!

screenshot_20190116_131540

benruijl commented 5 years ago

Yes you are right, sorry I got the terminology wrong. I implemented a form of hyperduals where e_i*ej=0 for any i,j though. e_i*e_j corresponds to a second-order derivative, and I just keep the first order.

ChristopherRabotin commented 5 years ago

All good! This means I've already got a few test cases we can add in since I use hyperduals in my code ;-)

ChristopherRabotin commented 5 years ago

I currently have a big backlog between different projects (working on a proof of concept for better frames management in nyx, and then I want more obvious support for TAI and UTC times in hifitime). Once those are done, maybe can I try to tackle this work. This means I might only start working on this in three to four weeks.

ChristopherRabotin commented 5 years ago

@novacrazy , @benruijl I'll start working on this tomorrow, unless one of you has already started doing so.

ChristopherRabotin commented 5 years ago

I've done most of the trivial changes, so I'm probably about 40% there. Still need to change the most important parts of the code. Oh, I also renamed the struct from DualN to Hyperdual and bumped the version from 0.2.5 to 0.3.0.

I'll be adding tests of the gradient matrix too to check that this works well. I suspect that this will lead to speed improvements in libraries (library?) which use this crate for hyperdual calculations.

ChristopherRabotin commented 5 years ago

As a coworker pointed out, hyperduals should allow to extract the second derivative (or "multidimensional duals" to use language I've only just now seen in the Julia ForwardDiff package paper). As correctly pointed out by @benruijl above, e_i*e_j = 0 causes us to lose the knowledge of this second derivative. Isn't this something we would want? I could use that information in my research for optimization problems.

If so, any idea on what e_i*e_j should be equal to? (I'll look into other papers to see how that was implemented.)

benruijl commented 5 years ago

You could keep those terms in, but note that these are only the cross terms. You won't get the e_i*e_i terms, since we set those to 0 (for now).

If you want to keep the cross terms, could you make a special struct for those, since it involves extra computations?

On Thu, 21 Feb 2019, 14:48 Chris, notifications@github.com wrote:

As a coworker pointed out, hyperduals should allow to extract the second derivative (or "multidimensional duals" to use language I've only just now seen in the Julia ForwardDiff package paper https://arxiv.org/pdf/1607.07892.pdf). As correctly pointed out by @benruijl https://github.com/benruijl above, e_i*e_j = 0 causes us to lose the knowledge of this second derivative. Isn't this something we would want? I could use that information in my research for optimization problems.

If so, any idea on what e_i*e_j should be equal to? (I'll look into other papers to see how that was implemented.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/novacrazy/dual_num/issues/12#issuecomment-465882509, or mute the thread https://github.com/notifications/unsubscribe-auth/AARGGTXJVSBIbRIEul7Q1GIQaNVkkXBGks5vPkE_gaJpZM4aDiwD .

ChristopherRabotin commented 5 years ago

By a special struct, you mean that there would be one struct for Dual<N, T> and another for Hyperdual<N, T> where the Hyperdual would allow extraction of the second order derivatives? Or do you mean something else entirely?

benruijl commented 5 years ago

Yes that's what I mean. One could even make a Hyperdual<N1, N2, T> to control the order, but that's a little bit more work to generalize (it needs Taylor expansions with arbitrary depth).

On Fri, 22 Feb 2019, 09:47 Chris, notifications@github.com wrote:

By a special struct, you mean that there would be one struct for Dual<N, T> and another for Hyperdual<N, T> where the Hyperdual would allow extraction of the second order derivatives? Or do you mean something else entirely?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/novacrazy/dual_num/issues/12#issuecomment-466240051, or mute the thread https://github.com/notifications/unsubscribe-auth/AARGGRKATWZhkJPg4ES-rBW7dxmB6sPRks5vP0uAgaJpZM4aDiwD .

ChristopherRabotin commented 5 years ago

Ha, yeah, that would be pretty darn awesome actually! It would violate the mathematical rule that e_i ^ 2 = 0 though. Christopher Rabotin.

On Thu, Feb 21, 2019 at 7:24 PM Ben Ruijl notifications@github.com wrote:

Yes that's what I mean. One could even make a Hyperdual<N1, N2, T> to control the order, but that's a little bit more work to generalize (it needs Taylor expansions with arbitrary depth).

On Fri, 22 Feb 2019, 09:47 Chris, notifications@github.com wrote:

By a special struct, you mean that there would be one struct for Dual<N, T> and another for Hyperdual<N, T> where the Hyperdual would allow extraction of the second order derivatives? Or do you mean something else entirely?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/novacrazy/dual_num/issues/12#issuecomment-466240051 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AARGGRKATWZhkJPg4ES-rBW7dxmB6sPRks5vP0uAgaJpZM4aDiwD

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/novacrazy/dual_num/issues/12#issuecomment-466248451, or mute the thread https://github.com/notifications/unsubscribe-auth/AEma6N1IS_znwlHHGO5fEMXsVBN4VYV1ks5vP1TxgaJpZM4aDiwD .

ChristopherRabotin commented 5 years ago

I've been busy on work projects, but I should be able to get back to this soon.

droundy commented 5 years ago

Are any of you aware of a possibility for reverse-mode differentiation with hyper-dual numbers? In particular, I have a common case in which a naive hyper-dual number approach would require O(N^2) memory rather than O(N) which is required for reverse-mode differentiation. I'd love to have a simple AD approach for my code, but square-rooting the size of system I can handle is a non-starter.

The physical system I'm looking at involves an array representing a scalar field (the density as a function of position), and I use FFTs to perform convolutions efficiently. Computing the gradient by hand isn't too painful but is very error-prone, so currently I've got a code generator (written in Haskell) that generates the derivatives. I'd love to have a simple library like dual_num, if I could see a way to get the gradient efficiently.

ChristopherRabotin commented 5 years ago

@droundy Here is my possible wrong understanding of forward-mode accumulation versus reverse mode. In forward mode, if a function is defined R^n -> R^m, then n evaluations of the function is needed. Is that correct to your understanding?

I'm not sure, but I wonder that the rewrite to hyperdual support (#17 , and especially #19) doesn't enable the reverse accumulation. I'm really not sure though. My rationale for this is that, as shown in #19 with examples in the code base, the computation of a matrix of partials only needs to iterate on the size of the output matrix (link) whereas before we would need per a double iteration on the greater dimension.

The Wikipedia article also mentions that forward and backward are two "extreme" ways of traveling the chain rule graph. So I wonder if the Rust compiler doesn't perform a number of optimizations which allow the compiled code to be neither purely forward or purely reverse accumulation.

I'll also note that, although I have yet to run benchmarks, but this new implementation "feels" faster.

I'd be happy to help you code up a toy problem similar to the one you are facing so we can determine whether this is forward or reverse accumulation, and whether the complexity penalty is prohibitive.

droundy commented 5 years ago

Yes @ChristopherRabotin, your understanding sounds correct.

The most serious issue is not time but space, and there is no way the compiler is going to change the space complexity of a computation (we hope?).

A toy example would be to compute something like this (stupid and slightly pseudo-code) function:

fn f(x: &Array) -> Scalar {
   let temp: Array = x*1.0;
   temp.sum()
}

where we want the gradient with respect to all the elements of the array. Obviously this is an inefficiently written function, but the issue is that forward differencing (or you could say, using hyper dual numbers) requires us to store a gradient for each element of the array, thus turning an algorithm that could use O(N) storage into one that uses O(N^2). In this toy case, you can trivally see that the gradient is all 1s.

A more complex example (still pseudocode) would be a nonlinear function of a simple convolution:

fn f(n_x: &Array) -> Scalar {
   let n_k = n_x.fft();
   let wn_k = n_k*wk; // wk is some constant array, element wise multiplication assumed
   let wn_x = wn_k.ifft();
   (wn_x*wn_x.log).sum()
}

With reverse-mode AD you can compute the gradient of this function with O(N) storage and O(NlogN) time, which relies on the fact that the FFT is a linear operator, and thus the gradient of a function of an FFT can be computed with the help of an IFFT in O(N) space.

As you say, one could go with an in-between approach where you use dual numbers for computing gradients of scalars, but at the boundaries of when vectors are involved you take the other approach?

On the other hand, it may be that my kind of problem simply isn't amenable to AD. My current approach is reverse-mode but involves code generation (which is a major nuisance), and I don't know how to make anything as user friendly as dual numbers. I was hoping (by posting here) that one of you would have an idea.

ChristopherRabotin commented 5 years ago

Answering specifically to the second example:

I think I understand your problem. And if I do, wouldn't a solution be for us to implement a .fft() function on a Dual number which would, in fact, compute an ifft for it?

We currently have to code up how to perform the derivative, cf. the sqrt function.

Do you think that could help? Then, any call to fft on any item of an array of Dual would compute the ifft on each dual part and the fft on the real part.

droundy commented 5 years ago

Actually, after writing up my last comment, I started thinking that what I need is probably outside the bounds of what a general-purpose library should handle. The problem is even easier to trigger than an FFT, just computing a plain sum with no intermediates causes the O(N^2) pain to happen, it happens as soon as you want to compute N derivatives in a dual number system.

The real reason why I can do my computations efficiently is that an FFT is the only way that I allow information to leak from one element of the array to another (except for summation). All other operations are elementwise, which allows me to only store one scalar derivative value per element of an array, which is the derivative of that element with regard to the corresponding element of the original input.

ChristopherRabotin commented 5 years ago

Closed via #17 .

novacrazy / dual_num

Hyperdual support #12