Gradients of log densities

marcoct commented 4 years ago

@bzinberg @agarret7 This is not super urgent, but it will pop up soon, and I wanted to have a place to start a discussion.

If it turns out that gradients with respect to values on these manifolds can be represented by vectors of real numbers (e.g. in the embedding space), and that gradient step on these manifolds can be approximated well by gradient steps in the embedding space followed by a projection onto the manifold, then I think there are good reasons to use real-vectors in the embedding space to represent elements of these manifolds and their gradients.

For example, we could use length-9 vectors of floats for 3D rotations (3x3 rotation matrices are a subset of this space), length-3 vectors of floats for directions, and length-2 vectors of floats for plane rotations. Model code that used these would then need to cast the values (e.g. via Rotation{3,Float64}(value) where value is a length-9 Vector{Float64} or something similar).

If elements of these spaces are instead treated as special data types (as is currently the case) then many general-purpose Gen inference procedures that use gradients would need to be generalized to work with these data types, for example, by making sure that all generic Gen inference code that uses gradients is independent of the internal data representation (it current assumes vectors of reals). That would require an increase in code complexity and might also limit performance in a way that committing to vectors of reals would not. Also, automatic differentiation would need to handle these special data types (which is pretty involved, and can be brittle). AD libraries are always going to be best at supporting arrays of floats.

But again, this hinges on the question of whether you can approximate the gradient on a manifold by a gradient in the embedding space. That sounds plausible to me, but not 100% obvious, and also something that might not be that hard to find out. But I have yet to do this.

marcoct commented 4 years ago

Note that moving to use vectors of reals instead of custom data types would be a change to the interface for this package. I have a hunch this is what we should do, but not sure, and would like to discuss.

bzinberg commented 4 years ago

But again, this hinges on the question of whether you can approximate the gradient on a manifold by a gradient in the embedding space. That sounds plausible to me, but not 100% obvious, and also something that might not be that hard to find out. But I have yet to do this.

It's not so much about getting the right gradient vector, but about making sense of what it means to "follow" the gradient vector along the manifold itself, rather than following it along the embedding of a tangent plane into the ambient space.

You have previously said that you do not want to enter into a general discussion of how to do gradient descent on manifolds, but TBH, I'm skeptical we can have a meaningful discussion without moving to essentially that level of generality.

If you fix a chart that contains the current state, then the function SO(3) -> R being optimized can be viewed in local coordinates as a function R^3 -> R. You can do one or several steps of gradient descent inside this chart, as long as the steps are small enough that the new state is still contained inside the chart. Then, you can switch charts, to some other chart that contains the new state.

"Will this process of repeatedly switching charts allow you to traverse the whole manifold?"

Yes. (I think this is true for every choice of atlas on SO(3), since SO(3) is a compact connected manifold, but in any case, we can choose our own atlas and ensure that it has this property.)

"How can our program know when to switch charts?"

Good question. I am interested in the general version of this question, but we can also talk about the specific version once we choose a specific atlas for SO(3).

bzinberg commented 4 years ago

BTW, a more complicated but more theoretically elegant approach to defining how to "follow" a tangent vector would be to use the exponential map on SO(3). This method is coordinate-free, whereas the above method depends on the particular choice of chart. But definitely can and should start with the simpler approach.

marcoct commented 4 years ago

You have previously said that you do not want to enter into a general discussion of how to do gradient descent on manifolds, but TBH, I'm skeptical we can have a meaningful discussion without moving to essentially that level of generality.

@bzinberg I think I said that it's higher priority to implement a modeling and inference library that supports the manifolds we will use in practice (e.g. those in in this package) as opposed to a maximally general library. I don't have a problem with discussing gradient descent on manifolds. But generally speaking, I do like to ground discussions in concrete examples.

Anyways, this looks like a pretty good reference with a lot of detail for the types of manifolds that this package cares about, and there is an associated library: https://arxiv.org/pdf/1812.01537.pdf

Also, GTSAM supports these manifolds and it would probably be wise to understand how they handle these things first. It might save a lot of reinvention of the wheel: https://gtsam.org/notes/GTSAM-Concepts.html It seems like GTSAM at least made the choice to introduce manfiold concepts into their core interfaces. That's informative, but it's not obvious to me that Gen should make the same choice. I want to understand the underlying operations and their implementation first. I am particularly concerned about how to implement things in a way that smoothly interacts with AD libraries.

If you fix a chart that contains the current state, then the function SO(3) -> R being optimized can be viewed in local coordinates as a function R^3 -> R

Yes, After quickly reading through the PDF above, I think a more specific version of my speculation above is whether it is possible to use 3x3 skew-symmetric matrices (which have three degrees of freedom) to represent gradients of scalar functions with respect to rotation matrices, and whether you can implement gradient descent simply by adding a small multiple of such a matrix (i.e. multiplied by a small step size) to the rotation matrix and get an updated new rotation matrix. Vaguely speaking, it seems like maybe this is correct only when the rotation matrix being added to is the identity rotation, and that you need to explicitly involve the exponential map. Interesting; lots to learn!

bzinberg commented 4 years ago

Cool, I'll take a look at those references when I get the chance. It's interesting that you have a strong default to start with the robotics and I have a strong default to start with the math :slightly_smiling_face:

simply by adding a small multiple of such a matrix (i.e. multiplied by a small step size) to the rotation matrix and get an updated new rotation matrix. Vaguely speaking, it seems like maybe this is correct only when the rotation matrix being added to is the identity rotation, and that you need to explicitly involve the exponential map. Interesting; lots to learn!

I think the right way to "increment" an element g of the Lie group by an element X of the Lie algebra is to multiply, i.e. exp(X) * g. The idea of the exponential map is to integrate along left-invariant vector fields, where "left-invariant" means "multiplication on the left," and this is all based on the idea that things should look the same when you move from a neighborhood U to a translated neighborhood {h * g : g in U}, where h is some fixed element of the Lie group and * denotes group multiplication (i.e. matrix multiplication in this case).

bzinberg commented 4 years ago

May also be of interest: https://github.com/JuliaGeometry/Rotations.jl/issues/30. Floating a proof-of-concept for this is still on my list...

bzinberg commented 4 years ago

whether it is possible to use 3x3 skew-symmetric matrices (which have three degrees of freedom) to represent gradients of scalar functions with respect to rotation matrices

That's possible to do, as long as you're within a specific chart -- that's the "simpler" approach I was describing above. The exponential map approach is fancier and doesn't depend on a specific choice of chart.

marcoct commented 3 years ago

This is very relevant: https://probprog.cc/assets/posters/thu/78.pdf

probcomp / GenDirectionalStats.jl

Gradients of log densities #3