popsim-consortium / stdpopsim

A library of standard population genetic models
GNU General Public License v3.0
123 stars 87 forks source link

Simulating model parameters from priors #299

Open gtsambos opened 4 years ago

gtsambos commented 4 years ago

Inspired by the discussion in #269, I wanted to open up some discussion about the possibility of modifying the catalog models to give users the option of simulating the parameter values from prior distributions, rather than fixing the parameters from the point estimates in the cited manuscripts.

Here's what I wrote about it over there:

Atm, it seems like stdpopsim is focused on fixing the parameters using point estimates from papers (a v worthy and ambitious goal in itself!) -- but if this broader alternative was to ever be part of stdpopsim, it might be useful to keep things as they are. In this situation, each simulation would be generated from a particular demographic object that is instantiated from a broader model class.

Btw, I don't know whether doing this would be "too broad" for stdpopsim, but I think that this would be a useful thing to at least think about given the stated aims of the consortium re: doing powerful standardised benchmarking. It would allow users to perform "sensitivity analyses" more easily - ie., tests of how robust various methods are to slight uncertainties and misspecifications in their model.

This is obviously going to be a bit of work (probably too much for it to realistically be part of the first release), but I think it would be very useful. In principle, I'd be happy to lead efforts to get this into stdpopsim as it dovetails quite well with some of my other PhD work at the moment, but I should probably talk more with my supervisors before I commit to this 😅

All discussion about this is welcome here, but some particular things I want to hear about are:

  1. What would you need for this feature to be useful for you?

  2. For those of you who are model developers, and who have scrutinised papers for details of these models -- did you typically see any other information that might have been been useful for the construction of priors? (For instance, standard errors of the parameter estimates???) How informative or not should we aim to make our own priors?

  3. How might we be able to shoehorn this into the existing code base with minimal changes to the existing structure?

jeromekelleher commented 4 years ago

This is a great idea, and would be a useful feature for many people I guess. It's definitely post 0.1.0 though :smile:

gtsambos commented 4 years ago

It's definitely post 0.1.0 though 😄

make another issue tag for 'dreams and hopes' 🔮

What would you need for this feature to be useful for you?

I'm guessing that people who use this would typically want to run multiple simulations from the same demographic model. So, I guess a baby first step could be: adding a --replicates option to the CLI? I don't this would be too hard, as there's already an option for this in msprime.

jeromekelleher commented 4 years ago

I'm guessing that people who use this would typically want to run multiple simulations from the same demographic model. So, I guess a baby first step could be: adding a --replicates option to the CLI? I don't this would be too hard, as there's already an option for this in msprime.

That's awkward with the way we handle output though. In particular, we couldn't write the results to stdout. We would make --output-file mandatory if replicates is > 1 though. We'd have to have some way of numbering the output files too.

grahamgower commented 4 years ago
  1. Often 95% confidence intervals are reported for inferred model parameters. Its not clear what shape such distributions should take, in general.
  2. Oops. I already posted this nastiness on the other thread: https://github.com/popgensims/stdpopsim/issues/269#issuecomment-563177292
andrewkern commented 4 years ago

just here to say i love the ENHANCEMENT label

gtsambos commented 4 years ago

That's awkward with the way we handle output though. In particular, we couldn't write the results to stdout. We would make --output-file mandatory if replicates is > 1 though. We'd have to have some way of numbering the output files too.

I think if you were doing this you would probably be writing the output to a file anyway, and I think you could just make it so that if the user specifies an output parameter foo.trees, the replicates are called foo.trees.1, foo.trees.2 etc

I didn't consider the fact that the msprime replicates function simulates replicates from the same parameter values every time, which is not what we want here. So actually, maybe what we want here is not quite --replicates but something else.

agladstein commented 4 years ago

Following this issue! I'd be up for working with someone on this (in the future). I've done a lot of thinking about this for ABC stuff and it's the main point of my SimPrily project https://agladstein.github.io/SimPrily/

One of the main features of SimPrily is that I include a high throughput workflow for the Open Science Grid for many long running simulations - this way beginners can run millions of hr long simulations (e.g. large locus sizes with relatively low recombination rate to mutation rate ratio). We could easily do the same thing for stdpopsim. But also, we can include smaller scale distribution for multi-core machines and clusters. I think this is important because if you are using priors, you are probably going to be running at least 1000's of simulations and they are not always fast. Based on simulation parameters we could print out an initial warning like, "hey, you just asked to simulation N simulations of R length, and you are only using 1 CPU, that'll probably take a long time".