simonhmartin / twisst

Topology weighting by iterative sampling of sub-trees
GNU General Public License v3.0
72 stars 19 forks source link

Any statistical methods to infer "significant" signals of introgression via local regions ? #40

Closed Callithrix-omics closed 1 year ago

Callithrix-omics commented 1 year ago

Would anyone be able to give insight if there is a statistical method for Twisst results to infer "significant" signals of introgression via local regions of the genome where the alternative topology is greater than the true topology?

thank you

simonhmartin commented 1 year ago

This is not a straightforward problem.

First, we need to clarify what you mean by significant. In classic tests for introgression, like the ABBA-BABA test, "significant" means that the genome-wide skew in the number of ABBA and BABA sites (which represent two discordant topologies) cannot be explained in a model without gene flow. In other words, given enough of an excess of ABBA over BABA sites, we can reject the null model the there is no gene flow. However, this genome-wide measure cannot tell us which of those ABBA sites is caused by gene flow and which is caused by ancestral inclomplete lineage sorting (ILS). It can only tell us that some of them are caused by gene flow.

When we think about local genealogies, we can make a similar argument. The multispecies coalescent model tells us that, under any given species tree, even without gene flow, every possible local genealogy imaginable (and therefore every possible local Twisst weightings) is theoretically possible. Of course some are much more likely than others, but we usually see that weightings for all possible topologies are > 0 genome-wide, which is not surprising. We can quite easily identify topologies that have higher weightings on average than others that should be expected to occur at the same frequency. For example, in Figure 2 of this paper Topology 6 and Topology 9 should have equivalent weightings in the absence of gene flow, so the massive excess of topology 6 is consistent with genome-wide introgression, but this cannot tell us whether any specific localised region with a high weighting for topology 6 is caused by introgression rather than ILS.

So again we need to ask what we mean by "significant" for a local genealogy. Do we mean that that it is highly unlikely to have seen such a weighting anywhere in the genome by chance under a model without gene flow? This is a very high bar, because it requires an extremely low expected weighting for that topology - essentially zero. We also usually don't know the true expected weighting for a given topology, because to compute that would require knowledge of the full species history: split times, population sizes of ancestral populations, and rates of gene flow.

However, with Twisst there is additional information because the higher the local weighting for a given topology, the more monophyletic it is. Theoretically, we might be able to identify statistical outliers for windows that have very high local weightings despite having a low genome-wide average. I would not say that this is statistical evidence for introgression per se, but could provide statistical support that the introgression was adaptive. An example of this can be seen in Figure 3 of this paper, where topologies 11, 14 and 6 have relatively low average weightings, but the boxplot shows that they have numerous local outliers that appear to exceed what is expected under the absence of selection. Currently I am not aware of any theoretical proofs of these expectations, but I think one can make a compelling verbal argument if the data appears as it does in this figure 3.

I hope this helps, and I'm sorry there is no simple answer.

Simon

Callithrix-omics commented 1 year ago

thank you @simonhmartin for your detailed and thoughtful answer.