data-dependent Gram-Schmidt orthogonalization

wsdewitt commented 4 years ago

We are currently using a standard Gram-Schmidt basis for the simplex, which is independent of the data. We might instead use a basis defined from the data, like this: https://arxiv.org/abs/1106.1451.

We have seen artifacts in rank penalization (shrinkage toward compositional balance) that I suspect arise from using the data-naive basis, especially for the nuclear norm penalty. Perhaps we'd be able to use the nuclear norm penalty instead of the rank penalty if we had a smarter basis, and then all our penalties would be convex.

kamdh commented 4 years ago

This "shrinkage towards compositional balance" is the same as shrinkage towards the origin in the transformed space, right? I wouldn't think that's from the rank penalty but rather the L2/ridge penalty. Rank shouldn't perform shrinkage.

kamdh commented 4 years ago

I still am fairly confused about the whole ilr/inverse ilr transform. From what you said at one point, the inverse transform is just the softmax function. Have we tried just using the softmax function f(Z) to model the relative mutation rates, without any basis transformations? It seems like that would also be okay since it ensures that they sum to 1.

wsdewitt commented 4 years ago

I was seeing it with the nuclear norm penalty, which does shrink toward zero (soft thresholding). In principle the Frobenius penalty would do it too, but I've never applied a strong enough one.

The idea would be to make the origin of the Euclidean space correspond to a point estimate based on the data, rather than the uniform composition.

Softmax does normalize, but it doesn't constrain. Optimization would be problematic because the loss would invariant under z --> z + const. The ilr solves this problem.

kamdh commented 4 years ago

Yeah, but the regularization wouldn't be invariant under that transformation, so I'm not sure it would matter. It just seems simpler to me to apply a basic softmax rather than the more complicated ilr business.

What do you mean by "constrain"?

It seems like we've already solved this problem via the rank penalty.

wsdewitt commented 4 years ago

Constrain was the wrong word, I just meant there's a ridge. Yeah I think the ridge penalty could cope with this if we formulated the cost with μ = μ0 * softmax(baseline + z), where _exp(baseline_j) = Sj / S is our initial constant estimate of the relative mutation rate for each type. Then the ridge penalty (and soft sv thresholding) will shrink towards this estimate, rather than the naive uniform composition.

However, with ilr:

We have m fewer parameters (column dimension smaller by one) before we even begin optimization/regularization (where m is the number of epochs). Are you thinking this doesn't matter because of rank penalty?
It's nice that we can remove this ridge just by our formulation of the likelihood, without a penalty. Do you think with softmax the tuning of the ridge parameter becomes more crucial?
It's nice to have invertibility μ <==> z (isn't it always?)

wsdewitt commented 4 years ago

(equivalently we can use softmax(z) and then put ||z - baseline|| in the Tikhonov penalty)

kamdh commented 4 years ago

Got it, that makes more sense. So the idea is to have the optimization work with a variable that defines a perturbation from the naive estimate S_j / S. I like your formulation μ = μ0 * softmax(baseline + z).

I agree that the ilr is principled, and could offer benefits. But the fact that it uses the medians makes it kind of complicated IMO.

Fewer parameters is good. It probably doesn't matter too much, but we won't know until we try. (The extra parameters can be seen as parametrizing the translations that leave softmax invariant.)
I don't think it will be super crucial. The optimization is trying to find a matrix W = softmax(Z) or = inverseilr(Z) = softmax(AZ). One thing I could see the ilr helping with is somehow making the conditioning better, since the matrix A might help there.
Invertibility is nice, although it isn't necessary if what we interpret in the end are the mu variables.

I am not super confident in any of these details, I'd say. If it isn't hard to try, I'd think we could just try the formulation you came up with (keeping the regularization terms exactly as-is) and see what comes out.

I wasn't thinking of this as a big problem. Is this issue something you've found while trying to get the parameters tuned for various real datasets?

kamdh commented 4 years ago

P.S. I resurrected a nice easy test case that I'll put back into the main branch

kamdh commented 4 years ago

P.P.S. The equation W = softmax(baseline + Z) isn't invertible, however it would be easy to find a Z that solves it by setting the first column of Z equal to zero or something.

wsdewitt commented 4 years ago

When you say "uses the medians" do you mean the geometric means for the clr? Note these don't ever appear in the loss. They only show up when you transform to the Aitchison space (e.g. in some of our plotting functions). In the loss we're always doing the inverse transform out of Aitchison space:

_μ = μ0 ilr_inv(z) = μ0 softmax(Ψ z)_,

where Ψ is the constant matrix of orthonormal basis vectors for the simplex. So this looks not so much different, except for the appearance of Ψ.

kamdh commented 4 years ago

Yeah, somehow the geometric means enter into the ilr function, right? I.e. to go from a mutation space into a z space.

kamdh commented 4 years ago

My impression is that these more complicated ILR transforms (linked when you opened issue) are useful for statistical analysis of compositional data, but I'm not certain they will really buy us anything in this inference problem. So my instinct is to go for the least complicated method of ensuring that the rows of the MuSH sum to 1.

kamdh commented 4 years ago

If you want to ensure that the shrinkage/regularization pushes you towards S_j / S, then I think the baseline formulation is good.

wsdewitt commented 4 years ago

Yeah, somehow the geometric means enter into the ilr function, right? I.e. to go from a mutation space into a z space.

Yeah, I was just saying we're only ever using the inverse in the cost, which is softmax(Ψ z)

kamdh commented 4 years ago

I created a branch that simply does softmax without the change of basis, and it works for the simulation_simple notebook. I have not extensively tested it, but it seems good. I have not added the baseline, however.

Re: invertibility. I just initialize with the log https://math.stackexchange.com/questions/2786600/invert-the-softmax-function

harrispopgen / mushi

data-dependent Gram-Schmidt orthogonalization #51