related-sciences / gwas-analysis

GWAS data analysis experiments
Apache License 2.0
24 stars 6 forks source link

PyData prototype simulation methods #31

Open eric-czech opened 4 years ago

eric-czech commented 4 years ago

We should start thinking about how to simulate data as part of a public API. PLINK and Hail support this and I think we should think about it now because it will be an important part of improving unit testing. I was chatting with @ravwojdyla and we're both at about the same place in our testing -- we have more simplistic test cases now but would both benefit from synthetic data representing a single dimension of genetic structure, likely with some tunable level of complexity. Essentially we need a better version of Hypothesis and while we're at it, why not make it part of the API?

Some examples:

It may be that most users don't care about simulators that aren't representative of comprehensive genetic structure (e.g. hapgen), but I think being explicit about our simulations would improve understanding of the methods and that this should be something we coordinate on regardless, rather than making private versions for test cases on a per-method basis. This would also make it much easier to demonstrate what a method does without always having to appeal real datasets.

eric-czech commented 4 years ago

Along these lines, G2P (from https://pubmed.ncbi.nlm.nih.gov/30848784/) is probably the most comprehensive tool I've seen so far for generating both genotypes and phenotypes with a good bit of configurability.

PhenotypeSimulator (Mayer & Birney 2018) is another decent one for layering in relationships between given genotypes and generated phenotypes. What it supports is a good outline for what a simulator for association testing should do:

The examples are pretty good too, e.g.:

A few other phenotype-only tools:

eric-czech commented 4 years ago

Another one that combines Balding-Nichols and the Pritchard-Stephens-Donnelly (PSD) for simulating admixture across populations to produce more realistic kinship matrices: bnpsd

eric-czech commented 4 years ago

Note: If we do include simulation functions, they should definitely support synthetic missingness. That was a big hang-up I had in using simulated data to better understand how Hail works.

eric-czech commented 4 years ago

see also: https://discourse.smadstatgen.org/t/common-patterns-in-human-population-simulation/48

eric-czech commented 4 years ago

Add this implementation of BN/PSD in dask at some point: https://github.com/dask/dask/issues/6227#issuecomment-633084419