uqrmaie1 / admixtools

https://uqrmaie1.github.io/admixtools
71 stars 14 forks source link

running find_graphs with the initgraph parameter #43

Open fangdm opened 1 year ago

fangdm commented 1 year ago

Hi, first of all, many thanks for developing and maintaining this software. I have been using the find_graphs command in AdmixTools2 for analyzing population data. However, I have noticed significant score differences between repeated runs of find_graphs, and it appears that some results with lower scores align better with my expectations. I recently discovered the initgraph parameter, which I would like to use when adding additional populations to the previous best graph result. However, I encountered difficulties in saving the previous graph as a graph file. Could you please provide guidance on the command line for accomplishing this task?

Thanks.

uqrmaie1 commented 1 year ago

many thanks for developing and maintaining this software

Happy to take credit for developing it

I have noticed significant score differences between repeated runs of find_graphs

To get identical results for repeated runs of find_graphs, you can run set.seed() just before find_graphs()

it appears that some results with lower scores align better with my expectations

Lower scores correspond to graphs with better fits (the scores represent log likelihoods and should be negative, but the minus sign is omitted, so lower is better)

I recently discovered the initgraph parameter

This is how the initgraph argument can be used:

run1 = find_graphs(example_f2_blocks)
bestgraph = run1 %>% slice_min(score) %>% pluck('graph', 1)
run2 = find_graphs(example_f2_blocks, initgraph = bestgraph)

The graph should be passed as an R igraph object, and the initgraph argument has to be named (the documentation has an outdated example where initgraph is a positional argument).

Functions for reading/writing graphs from/to disk are described here.

... align better with my expectations ... which I would like to use when adding additional populations to the previous best graph result

When graph fitting depends a lot on prior expectations, and when it is used in this kind of supervised, stepwise manner, it's very easy to overfit, and it becomes hard to tell whether any conclusions can be drawn from the fitting models. I realize that not constraining the models or using prior expectations tends to lead to useless results, but often this just means that there is not enough information in the data to draw any strong conclusions.

fangdm commented 1 year ago

Thank you very much.

I had a question: based on the following command line, we did add some new groups in add_data, why don't the newly added groups in add_data appear in the new graph? Should the individuals/groups in the initgraph object (referred to as bestgraph) had to be consistent with add_data?

run2 = find_graphs(add_data, initgraph = bestgraph)

In our project, we had a lot of groups with complex relations. To address this, we propose a gradual augmentation of the previous bestgraph by incorporating new groups, and determined the final graph, whether this approach is feasible?

Thanks again.

uqrmaie1 commented 1 year ago

Yes, the groups in initgraph should have the same groups as the groups in add_data. I think it's a good idea to let find_graphs() add additional groups to the initgraph model, but it's not set up in that way, and I probably won't have time to make that change.

What you could do instead is to add a new group to all possible positions in bestgraph using graph_addleaf() and evaluate all resulting graphs:

bestgraph %>%
  graph_addleaf('newgroup') %>%
  rowwise %>%
  mutate(score = qpgraph(add_data, graph)$score) %>%
  ungroup

To address this, we propose a gradual augmentation of the previous bestgraph by incorporating new groups, and determined the final graph, whether this approach is feasible?

It's possible in principle, but there are things to look out for:

  1. The order in which the groups are added can have a big impact on the final best topologies, so it might be a good idea to repeat the analysis while changing the order in which the groups are added.

  2. It's very easy to overfit. One simple way to guard against it is to only use even chromosomes for the whole analysis. Then at the very end, repeat the analysis with only odd chromosomes. If the results for even and odd chromosomes are different, it's probably due to overfitting. This is overly conservative because it only uses half the data each time. On the other hand, if the results are the same or very similar for even and odd chromosomes, it's strong evidence that the resulting models are not overfitted.

In my experience, it is rarely possible to fit models with more than 6 populations and more than 2 admixture events without some degree of overfitting.