mjhajharia / transforms

2 stars 1 forks source link

mathematically defining transforms #36

Open bob-carpenter opened 2 years ago

bob-carpenter commented 2 years ago

The paper's getting a bit out of hand with everyone writing in their own style. I'm trying to work with Meenal now to strip out most of what's there and then build up a draft with coherent notation. As such, the first thing we need to do is define transforms. To me, the goal of the whole business is that when we sample on the unconstrained scale and transform, we get a uniform distribution on the constrained scale. So informally, we want something like this.

Suppose $\mathcal{Y}$ is a domain of unconstrained parameters and $\mathcal{X}$ is a domain of constrained parameters. For now, $\mathcal{Y} = \mathbb{R}^N$ for some $N$ and $\mathcal{X} \subseteq \mathbb{R}^M$ for some $M$. We further suppose a (perhaps improper) density function $p_Y: \mathcal{Y} \rightarrow (0, \infty)$ with support over all elements of $\mathcal{Y}$. We further suppose we have a surjective (onto) map $g:\mathcal{Y} \rightarrow \mathcal{X}$. Informally, we want to ensure that if we have $Y \sim p_Y(\cdot)$, then $g(Y) \sim \textrm{uniform}(\mathcal{X}).$

The reason this definition doesn't work is that $p_Y$ may be improper, so it doesn't really make sense to talk about $Y \sim p_Y(\cdot)$. So instead, I think we want this:

Definition (Constraining Transform): Suppose $\mathcal{Y}$ is a domain of unconstrained parameters and $\mathcal{X}$ is a domain of constrained parameters. Further suppose a (perhaps improper) density function $p_Y: \mathcal{Y} \rightarrow (0, \infty)$ with support over all elements of $\mathcal{Y}$ and a surjective map $g:\mathcal{Y} \rightarrow \mathcal{X}$. The pair $(p_Y, g)$ is a constraining transform from $\mathcal{Y}$ to $\mathcal{X}$ if for any proper density $\pi_X(x)$ defined over $\mathcal{X}$, the density defined by $p_Y(y) = \pi_Y(y) \cdot \pi_X(g(y))$ is proper and if $Y \sim \pi_Y$, then $X = g(Y) \sim \pi_X$.

We can then go on and define the special case of one-one changes of variables.

Lemma (Change of Variables): Suppose $\mathcal{Y} = \mathbb{R}^N$, $\mathcal{X} \subseteq \mathbb{R}^N$, $\pi_X:\mathcal{X} \rightarrow (0, \infty)$ is a proper density, and $f:\mathcal{X} \rightarrow \mathcal{Y}$ is a differentiable function. If we let $\pi_Y(y) = \left| \textrm{det} \frac{\partial}{\partial y} f^{-1}(y) \right|$, then the pair $(\pi_Y, f^{-1})$ is a constraining transform. Proof: From the change of variables formula, $p_Y(y) = p_X(f^{-1}(y)) \left| \textrm{det} \frac{\partial}{\partial y} f^{-1}(y) \right|.$

We can then move on to the example of simplexes, where we are defining a transform from $\mathbb{R}^N$ to $\Delta^N \subseteq \mathbb{R}^{N+1}$ by setting element $N + 1$ to $1$ minus the sum of the first $N$ elements. Another example is going from $\binom{M}{2} + M$ to $M \times M$ dimensions with the Cholesky factor of covariance matrix transform by multiplying by itself transposed.

sethaxen commented 2 years ago

In general I like this formulation, but I had some thoughts (sorry this is long):

bob-carpenter commented 2 years ago

There may be multiple competing notions of uniformity

I suggest we should be explicit that we assume our densities are defined wrt to a uniform measure for some notion of uniformity, so the desired constraining transform is one that takes some measure on the unconstrained domain and turns it into the desired uniform measure on the constrained transform.

I would love to make this explicit in a way that I can understand. My measure theory is weak past the basic definitions of sigma algebras and Lebesgue integration. The uniformity we want is for all $x \in \mathcal{X}$,

$$p_X(x) = \textrm{const}$$

I don't know how to say that when we're dealing with unconstrained spaces that lead to improper "densities".

"support over all elements of Y" : I think technically it is okay if there are elements not supported, so long as the set of such elements has measure 0.

Thanks. We should be as mathematically precise as possible without turning this into an impenetrable measure theory paper.

In practice, we run into numerical problems even in simple positive-constraining transforms like exp() which quickly overflow to +infinity and underflow to zero.

Rightly so, a reader may be confused how such a map that apparently has a larger dimensional output than input could be bijective.

Isn't it only bijective with the constrained space? The Jacobian is from the Cholesky factor with $\binom{N}{2} + N$ elements to the lower-triangular portion of the covariance matrix, which has the same number of elements. The above-diagonal elements are just copied. So we'll go through and say that as we do it. Same with the simplex---the transform is just to the first $N - 1$ dimensions of the output with the last element being defined as one minus the sum of the other elements.

But if we clarify that the notion of uniformity is over a super-diagonal, while the parameterization is the dense symmetric matrix, then its clear both what map they need to compute the transform and which map they need the Jacobian of.

What does super-diagonal mean?

This is personal preference, but I think it's cleaner notationally if we work primarily with forward transforms

I was trying to lay out the general approach that way. Do you think we should just skip the usual direction of transform and inverse when talking about changes of variables? We can do that, but we probably want a footnote or something to connect the dots, because every stats presentation I've ever seen does it the way I wrote it in the Stan reference manual.

mjhajharia commented 2 years ago
  • Separate from notion of uniformity is the notion of parameterization. e.g. a transform g:Y→X outputs a certain parameterization of x. e.g. for PD matrices, densities are defined wrt the Lebesgue measure on a triangle of the matrix. So the map from unconstrained space to the triangle is the one we need the log-det-jac of. But we expect PD matrices to be parameterized not by just a triangle but rather by a dense symmetric matrix. So the transform we write down and that we might implement in code often isn't the one we need the Jacobian of.

similar thing with alr too right? we don't actually take the Nth dimension in the jacobian. I'm not sure about how would one go about communicating that idea, in a general way