Inconsistent bootstrap significance testing

theo-atkinson commented 4 months ago

Hi there,

I am currently trying to perform bootstrap significance testing between the best topologies produced from multiple find_graphs iterations. I have noticed that the p_emp value it not necessarily consistent with the score of the graphs.

Please take the following as an example, having run multiple iterations of findgraphs with 1 admixture event for simplicity:

The single graph with the lowest score from each find_graphs iteration are concatenated into a tibble.
Visual inpection of the graphs, and the use of summarize_numadmix and summarize_eventorder shows that multiple graphs have the exact same topology and admixture pattern.
As expected, the scores of such graphs are also almost identical i.e. graph 1a score = 13.1951076092863
graph 1b score = 13.1951076093899.
Another graph with a different topology to graph 1a and 1b is also identified: graph 2 score = 14.8503609038694
Performing bootstrap resampling for graph 1a against graph 2 using qpgraph_resample_multi(nboot = 1000) and compare_fits produces: diff = -1.798 se = 7.751 z = -0.231 p = 0.816 p_emp = 0.656 p_emp_nocorr = 0.656 ci_low = -17.646 ci_high = 17.091
However, performing bootstrap resampling for graph 1b against graph 2 (using the same seed) produces: diff = 49.852 (note this is positive) se = 26.077 z = 1.911 p = 0.055 p_emp = 0.004 p_emp_nocorr = 0.004 ci_low = 14.554 ci_high = 114.796
Upon further inspection it would appear that graphs worst residual scores seem to correlate with the inconsistency in bootstrap results.
Worst residual scores were obtained using qpgraph(return_fstats = TRUE) graph 1a worst residual = 2.29609734687341 graph 1b worst residual = 2.69210878573867 graph 2 worst residual = 2.5461742278267
Multiple graphs with the same score and worst residual as graphs 1a and 1b resulted in the same bootstrap results respectively.
Plotting scores against worst residuals shows that despite graphs being identical, their worst residuals seem to be clustered into 2 groups:

This begs the following questions:

Why is it that graphs with identical topologies, admixture patterns and scores have different worst residuals?
What is the relationship between worst residuals and bootstrap results?
What can be done, if anything, to obtain more consistent results?

For context 1000 iterations of find_graphs was run with 8 populations (1 to 4 samples per population) including an outgroup, using ~80,000 SNPs with no missingness at the population level and the following parameters: numadmix = 1 stop_gen = 10000 stop_gen2 = 30 numgraphs = 10 plusminus_generations = 10

Thanks a lot!

uqrmaie1 commented 4 months ago

Thanks for sharing this example! I'm surprised by it, and I'm not sure what's causing those results. Could you share the graphs 1a, 1b, and 2, as well as the f-statistics with me? That should allow me to figure out what's going on.

I suspect that graphs 1a and 1b don't actually have the same topology. There is a function called isomorphism_classes2() which you can use to find out which graphs in a list of graphs are identical to each other.

What might also be happening is that with your graphs and data, the qpgraph() function isn't consistently able to find the graph parameters that minimize the score. By default, qpgraph() optimizes the set of graph parameters starting from 10 different random sets of initial parameters (argument numstart), and it returns results from the best fit. find_graphs() only uses a single set of initial parameters for each graph by default, which can make topologies look worse than they really are, if some initial parameters don't find the global optimum. On the other hand, find_graphs() evaluates many graphs, which makes it less likely that the best reported graph got stuck in a local optimum.

If the root of the issue is that qpgraph() only sometimes finds the best parameters for those graphs, you should be able to get more consistent results by setting numstart to higher values. You should be able to use that argument also in find_graphs() and qpgraph_resample_multi(). It is passed to qpgraph() via ....

theo-atkinson commented 4 months ago

Hi there,

Here are the graphs and f2 stats (let me know if you need f3 and f4) Graph 1a: graph-1a-f2.txt graph-1a

Graph 1b: graph-1b-f2.txt graph-1b

Graph 2: graph-2-f2.txt graph-2

Yes you are right, graphs 1a and 1b are considered different graphs having used the isomorphism_classes2() function. In fact, there are far more topologies than I had previously realised. How does the isomorphism_classes2() function identify identical graphs?

Apologies if I do not follow correctly, but regarding the numstart function, are you suggesting that as find_graphs() only uses a single set of initial parameters for each graph, in some iterations the optimum fit is not always found, inflating the worst residual score?

Thanks!

uqrmaie1 commented 4 months ago

Thanks for sharing the data! What I would need to reproduce the behavior is the data in the form in which you pass it to the find_graphs() and qpgraph_resample_multi() functions. You could save the graphs and f-stats (the per-block statistics) using the function saveRDS().

If you don't want to share the population names, you can rename them like this: vertex_attr(graph) = list(name = new_vertex_names). You would also have to rename the f-statistics. If you use f2-statistics read into R, you can rename them like this: dimnames(f2_blocks)[[1]] = dimnames(f2_blocks)[[2]] = new_population_names.

Apologies if I do not follow correctly, but regarding the numstart function, are you suggesting that as find_graphs() only uses a single set of initial parameters for each graph, in some iterations the optimum fit is not always found, inflating the worst residual score?

Yes that's correct (except that numstart is the name of an argument in the qpgraph() function). For most graphs and f-statistics, this isn't a big problem, since most sets of initial parameters will lead to a very similar best fit score and worst residual, and similar graph parameters. But sometimes the parameters that lead to the globally best fit of a certain topology are only found under some initial parameters but not others. You could check if that's the case by taking the best graph found via find_graphs() and evaluate it with results = qpgraph(f2stats, bestgraph, numstart = 1000). results$opt will show you a data frame where each row lists the resulting parameters and score (column value) of one set of initial parameters.

How does the isomorphism_classes2() function identify identical graphs?

It starts by renaming the internal nodes of a graph so that graphs with the same topology and leaf node names, but different internal node names are considered identical. But looking at your graphs 1a and 1b, I wonder if the function has a bug, since these graphs look identical to me. If you don't want to share the labelled graphs here, you can send them to me via email. Or you could rename the graphs as I described above!

theo-atkinson commented 4 months ago

Hi,

I have run qpgraph() again on these graphs using numstart=1000 but have now noticed that the score for graph 1b is actually 18.7658 despite find_graphs() output showing a score of 13.2. This would explain the inflated worst residual but not the identical graph and topology. This seems to be the case for the other graphs with inflated worst residuals that I have checked. It sounds like some kind of error when saving the best graph from each find_graphs() iteration but I've looked at my script and can't seem to find the cause.

I have sent you an email with the fstats and script.

Thanks!

uqrmaie1 commented 4 months ago

For anyone else who comes across this thread:

The different scores and inconsistent results for graphs 1a and 1b between qpgraph_resample_multi() and compare_fits() were due to the fact that the f3-statistics for both graphs were calculated with respect to different base populations. By default, the f3 base population is chosen based on the edge order of a graph, which can be different even for graphs with identical topology. In cases like this one, where the fit of a graph is affected by the choice of f3-base-population, it can be manually set with the f3basepop argument: qpgraph(f2_stats, graph, f3basepop = "chimp"). In the latest release (v2.0.2), find_graphs() evaluates all graphs in the same run using the same f3basepop, so graphs with identical topology should have more similar fits (different fits for the same graph and f-statistics are still not always exactly identical because different random starting parameters can lead to different results).

Graphs 1a and 1b were considered non-identical due to a bug in isomorphism_classes2(), which is now fixed.

uqrmaie1 / admixtools

Inconsistent bootstrap significance testing #65