harrispopgen / mushi

[mu]tation [s]pectrum [h]istory [i]nference
https://harrispopgen.github.io/mushi/
MIT License
24 stars 5 forks source link

fused lasso #13

Closed wsdewitt closed 5 years ago

wsdewitt commented 5 years ago

Implement L1 penalty on \mu derivative (fused lasso), and perhaps keep some L2 to induce strong convexity. the idea is to infer piecewise constant histories, allowing for discontinuous jumps. This penalty term is not differentiable, so we'll need more general subgradient optimization methods. Jean Feng recommends ADMM. @kharris recommends proxTV.

We also have our Tikhonov matrix Gamma that weights according to proximity to the coalescent horizon, not just a difference operator, so might need to hack this a bit.

kamdh commented 5 years ago

Just commenting that proxTV is the correct package. ADMM is easy-to-implement and may work for us, but is less scalable to high-dimensional optimization problems and is usually slower to converge. I'm not sure what will be best in the end for us.

kamdh commented 5 years ago

Re: terminology, the one I was thinking you meant by "fused lasso" is "group lasso". I like calling it total variation, however.

wsdewitt commented 5 years ago

After a quick look at the proxTV docs I'm seeing only least squares loss. Do you know if it can accommodate our Poisson random field loss?

kamdh commented 5 years ago

Assume your cost C(x) = f(x) + g(x), where f & g are convex, f is differentiable, and g is not differentiable and we can compute its cost.

The Poisson loss is incorporated into the differentiable part of your cost function, f. So, presuming you can compute its gradient, you won't need to worry about that for the proximal gradient algorithm.

The prox operation solves a problem that combines a least squares term (1/2) ||x - y||_2^2 and a non-smooth term g (TV for us).

wsdewitt commented 5 years ago

Ah, I see—these optimizations are evaluating the prox operator, then the suggestion is to use these functions to make proximal gradient updates. Ok, I can try this.

I read in "Statistical Learning with Sparsity" section 4.5.1.1 that the difference matrix might lead to problems for proximal gradient descent. Section 4.5.1.4 describes a cool dynamic programming approach, but I don't think it works for us due to the log(C M z) in our log-likelihood function, so we can't pull out terms in a given z_i.