Protocol Advice: Establishing a t hresholds for the number of allowed admixture events

First off, I just want to say a huge thanks for creating such an awesome tool. I thoroughly enjoyed the ADMIXTOOLS paper published in eLife and found it very informative.

My query pertains specifically to the protocol detailed in the paper, particularly the first point on Initial Scanning and Complexity Class Determination. The paper mentions: "The smallest number of admixture events that yields models with a (negative) LL score or an f-statistic residual lower than a certain threshold should be further investigated by running additional iterations of findGraphs." I'm interested in understanding how one should determine this threshold. Could you offer any advice or guidance on setting an effective threshold for the log-likelihood (LL) score to ascertain the minimum number of admixture events?

As an example, in the attached plot, I have executed find_graphs with 8 populations and varying admixture events ranging from 0 to 6, where the y-axis represents the score for each graph.

plot

I appreciate any insights you could provide.

I'm glad that you find it useful!

I'm interested in understanding how one should determine this threshold.

That's a good question, and unfortunately I don't have a great answer to it, but I can say a few thing that are hopefully helpful:

The qpgraph score is calculated as $r^T \Sigma^{-1} r$ where $r$ are the residuals (difference between estimated and fitted f3-stats), and $\Sigma$ is the (regularized) f3-statistic covariance matrix. This should be proportional to the negative log-likelihood of the model. So there is no need to take the log of the score, unless you want it to be double logarithmic.
The score isn't the same as the log-likelihood because there is no good estimate of the error variance, or of the degrees of freedom in the model.
This makes it difficult to interpret the score in isolation; it's more useful when comparing the scores of different models with the same populations and same number of admixture events.
A major problem with fitting admixture graphs is that it's very easy to overfit to the data. Every increase in the number of admixture events (and the number of populations) greatly increases the number of possible models. It usually only takes a small number of allowed admixture events to reach a point where wildly different models fit the data equally well. That is the reason why it's a good idea to start with very simple models.
One model-fit metric that is easier to interpret than the score is the worst f-statistic residual z-score: Depending on who you ask, and how many populations are in your model, an absolute z-score greater than 3 to 5 or so is probably a good indication that the model is missing something. (On the other hand, lower z-scores don't mean that the model is good. It's easy to get low z-scores or qpgraph scores with bad data or complex models.)

To get the most out of these methods, it is necessary to develop a lot of experience by applying them to many different kinds of data, and to approach the model fitting process with a very critical mindset. My concern about any static protocol with fixed thresholds is that it might give people the false impression that following the protocol can substitute for looking at the data from 100 different angles with a very critical lens. At the same time, I see it as a big problem that the model fitting process is not a lot more objective, transparent, and accessible.

To get back to your question, instead of using the qpgraph LL-score to decide how many admixture events to include, you could use the more interpretable worst residual z-score.

Hi there,

To add to the advice requested above, how should the worst residual scores at each complexity class be compared? More specifically, should the mean scores be compared across complexity classes, or the scores of the best fitting graphs only?

My intuition would suggest that the scores of the best fitting graphs (lowest worst residuals) at each complexity class is more interpretable as it demonstrates the possibility for a graph topology to fit the data to a certain level.

In addition, how should a complexity class be interpretated if a large number of graphs equally achieve the best fit (and therefore same topology)? Would this suggest a complexity class that fits the data well, or perhaps that this complexity class is constraining potentially better fitting topologies?

Any advice would be appreciated. Thanks for the amazing tool!

should the mean scores be compared across complexity classes, or the scores of the best fitting graphs only

It makes more sense to look at the scores of the best fitting graphs only. The majority of the returned models are just random stepping stones along to way to topologies with a better fit.

how should a complexity class be interpretated if a large number of graphs equally achieve the best fit?

If a large number of different graph (with different topologies) achieve similar fits, it is an indication that there is not enough data to disambiguate between these different models. In that case it could help to fit simpler models (with fewer admixture events), until you get to a point where the best-fitting models all share a similar topology.

But it sounds like you have something else in mind since you ask about a large number of graphs with the same topology. Are you running multiple iterations of find_graphs(), and you get the same best graph (with the same topology) in each iteration? If so, that would make it more likely that this graph is a good model, in particular if the next-best graphs with a different topology have a significantly worse fit.

If the best graphs all have a similar score or worst residual, but the scores or worst residuals indicate that the fit is poor, then you could increase the number of admixture events.

I haven't seen examples with more than 3 or 4 admixture events where there is a clearly best fitting model, and no substantially different model can be found that has a similar score or worst residual, so I think it's a good idea to stick to fewer admixture events. However, it's possible, and probably common, that for the populations that you study, any model with a small number of admixture events is oversimplifying, which may result in a poor fit.

As you increase the number of admixture events, there isn't always a point where you go from no well-fitting model, to getting a single well-fitting model. Instead you get multiple, very different models with a good fit. This suggests that there isn't enough data to fit an accurate admixture graph to these populations.

uqrmaie1 / admixtools

Protocol Advice: Establishing a t hresholds for the number of allowed admixture events #56