How to avoid parameter sprawl?

matsen commented 10 years ago

An explicit penalty (e.g. L^1) as part of the likelihood
Shrinkage, Bayesian style (e.g. this post)
Adding parameters one at a time and doing sequential model selection

We won't be able to do number 2 fully, but perhaps it's a framework to start from?

I'd like to understand if you are thinking of all mutations as coming from these context-dependent processes, or if they are sitting on top of some base mutation process, like in your CpG example.

petrelharp commented 10 years ago

So, (1) and (3) seem straightforward.

For (2): I'm not really motivated by the PCA notion of that post, but the general framework seems fairly well worked out; here's some references. The horseshoe prior seems well-justified: parameters are Normal, with variances half-Cauchy (with independent variances for each parameter). A drawback is that we'd then have to also MCMC over the variances, since there isn't a nice expression for the posterior. (I think this is what they do? I can't quite tell. Maybe there is more info in this paper.

Also, that approach allows the parameters to be positive or negative. So far, I've been constraining the rates to be positive; although we could allow higher-order mutation rates to be negative, as long as the sum of relevant terms was nonnegative (i.e. G->T + GC -> TC should be nonnegative). I still think having nonnegative rates is the right way to go, but I could imagine a situation when allowing negative corrections would end up with a more parsimonious model.

The last question: I'm imagining mutation rates built up with rules of increasing complexity, explaining as much as possible with single-base rates, like in the CpG example. That's why things are set up allowing for patterns of arbtrary length, so the rate for CG -> CT adds to the rate for G -> T. This way we can have a nice, interpretable representation for the spectrum, putting as much emphasis as possible on single-base rates.

matsen commented 10 years ago

Thank you for thinking about this, Peter. In following the references, I learned a bunch, though I'm still quite new to this. Clearly there are lots of working parts already, so going with the simplest thing (which is probably 1, and your original idea!) seems smart. As Lartillot says, that is a type of soft shrinkage. I wonder, though, if potential reviewers would be happier with (3).

Re last question: sounds great.

petrelharp commented 10 years ago

Do you have any intuition about the nonnegativity question? My default model is like "G->T at some rate, but then being next to a C increases that"; but I could just as easily have said that the context decreases the rate, in which case the nonnegative constraints would end up with extra parameters for GA->TA, GG -> TG, and GT -> TT. Maybe there's nothing wrong with allowing negative rate adjustments, and constraining appropriate sums to be nonnegative?

matsen commented 10 years ago

I don't think I would have thought of negative rates! Seems fun.

Could we allow that formally, but say that what's really going on behind the scenes is that we have mutation rates that are all positive, but just happen to be the difference between two positive reals?

On Wed, May 14, 2014 at 10:11 AM, Peter Ralph notifications@github.comwrote:

Do you have any intuition about the nonnegativity question? My default model is like "G->T at some rate, but then being next to a C increases that"; but I could just as easily have said that the context decreases the rate, in which case the nonnegative constraints would end up with extra parameters for GA->TA, GG -> TG, and GT -> TT. Maybe there's nothing wrong with allowing negative rate adjustments, and constraining appropriate sums to be nonnegative?

— Reply to this email directly or view it on GitHubhttps://github.com/petrelharp/context/issues/2#issuecomment-43109041 .

Frederick "Erick" Matsen, Assistant Member Fred Hutchinson Cancer Research Center http://matsen.fhcrc.org/

petrelharp commented 10 years ago

Right.

On Wed, May 14, 2014 at 6:39 PM, Erick Matsen notifications@github.comwrote:

I don't think I would have thought of negative rates! Seems fun.

Could we allow that formally, but say that what's really going on behind the scenes is that we have mutation rates that are all positive, but just happen to be the difference between two positive reals?

On Wed, May 14, 2014 at 10:11 AM, Peter Ralph notifications@github.comwrote:

Do you have any intuition about the nonnegativity question? My default model is like "G->T at some rate, but then being next to a C increases that"; but I could just as easily have said that the context decreases the rate, in which case the nonnegative constraints would end up with extra parameters for GA->TA, GG -> TG, and GT -> TT. Maybe there's nothing wrong with allowing negative rate adjustments, and constraining appropriate sums to be nonnegative?

— Reply to this email directly or view it on GitHub< https://github.com/petrelharp/context/issues/2#issuecomment-43109041> .

Frederick "Erick" Matsen, Assistant Member Fred Hutchinson Cancer Research Center http://matsen.fhcrc.org/

— Reply to this email directly or view it on GitHubhttps://github.com/petrelharp/context/issues/2#issuecomment-43160398 .

matsen commented 10 years ago

Adding parameters one at a time and doing sequential model selection it is!

petrelharp / context

How to avoid parameter sprawl? #2