stschiff / msmc

Implementation of the multiple sequential markovian coalescent
GNU General Public License v3.0
87 stars 20 forks source link

Synthetic Haplotypes #21

Closed LiverpoolHarry closed 6 years ago

LiverpoolHarry commented 8 years ago

I am working on African populations which have substantial admixture. Running MSMC on genome sequences tend to give larger than expected times to MRCA. I assume that this is because we are not just looking at migration events but also admixture between previously isolated populations. I have about fifty genome sequences from each of seven populations so I have plenty of spare data. I am wondering if I could use PCAdmix to identify parts of the genome in each sample that are from a given founder population and then cobble together a set of virtual genomes which are all derived from a putatively single population and then run MSMC on those samples? Or will this introduce horrible artifacts?

Any thoughts would be most welcome.

stschiff commented 8 years ago

This might work, but you need to properly mask the genomes as missing data, not concatenate parts that are not in close vicinity. Also, you need to make sure that haplotypes are aligned properly. You cannot run MSMC on four haplotypes from different locations, they have to be in the same spot. Also, if only one of the haplotypes is missing, all of them are missing, MSMC cannot deal with missing data in only some haplotypes.

Hope that helped, Stephan

P.S. Please join the msmc-popgen google group. Thanks.

On 12 Apr 2016, at 16:28, LiverpoolHarry notifications@github.com wrote:

I am working on African populations which have substantial admixture. Running MSMC on genome sequences tend to give larger than expected times to MRCA. I assume that this is because we are not just looking at migration events but also admixture between previously isolated populations. I have about fifty genome sequences from each of seven populations so I have plenty of spare data. I am wondering if I could use PCAdmix to identify parts of the genome in each sample that are from a given founder population and then cobble together a set of virtual genomes which are all derived from a putatively single population and then run MSMC on those samples? Or will this introduce horrible artifacts?

Any thoughts would be most welcome.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/stschiff/msmc/issues/21