tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
173 stars 85 forks source link

Inconsistencies between algorithm.c and algorithm.py #2217

Open YuanboSong opened 11 months ago

YuanboSong commented 11 months ago

I've been looking at the algorithm.c and algorithm.py files in msprime and noticed some differences. The algorithm.c file seems to use diploid as a default setting, which affects how the simulations are run and the results we get. On the other hand, the algorithm.py file doesn't seem to have a setting for ploidy, which might lead to different results.

One clear difference I saw is in the single sweep model simulations. The Python version gives different diversity plots compared to the C version. I'm also not sure how forward simulations are run in the algorithm.py file, which might be part of the reason for the differences.

Thank you for looking into this!

GertjanBisschop commented 11 months ago

There are some differences between the sweep implementation in algorithms.py and msprime.c. From the code in algorithms.py I derive that the ploidy is implicitly defined as 2. From my attempts at generating sweeps using algorithms.py it seems to me that the best approach here would be to define a set of parameters that generates a clear sweep pattern. I think it is key here to take a small enough time step for the frequency trajectory simulation on which the coalescent is conditioned. Once you have such a set of parameters you should be able to modify the algorithm to accommodate for population structure. Forwards in time frequency simulations are done by an instance of the TrajectorySimulator class. Let me know if I missed something, or if you might have further questions.