uqrmaie1 / admixtools

https://uqrmaie1.github.io/admixtools
71 stars 14 forks source link

How to constrain find_graphs search? #33

Open dosshra opened 1 year ago

dosshra commented 1 year ago

Hello I have a set of populations related to wheat, some are domesticated and some not. Therefore, I am almost cretin which populations should be with early split and donors of admixture. However, when I run: opt_results = find_graphs('./MANCH',initgraph = g, max_admix=3, stop_gen = 100) using a graph (g) original_graph that I constructed manually based on prior knowledge, the graph with the lowest score is placing the ancestral populations (based on other solid data) at the bottom of the graph like this.find_graph_result

  1. Is there a way I constrain find_graphs to fit prior knowledge?
  2. Does the results that I am getting now suggest that my data may have some bias?

Wish you happy holidays Hanan

uqrmaie1 commented 1 year ago
  1. There are a couple of arguments that can be passed to find_graphs() to constrain the search space to fit prior knowledge:

    • initgraph sets a starting model. This is not really a constraint, but it can help with getting to good models faster.
    • outpop specifies an outgroup population
    • admix_constraints constrains the degree of admixture for different populations
    • event_constraints constrains which splits or admixture events should precede which other events. The last two options are briefly described here, but they were not tested much and are somewhat experimental
  2. I would recommend to not only look at the graph with the lowest score, but to compare multiple graphs with low scores from multiple independent runs of find_graphs(). If you find that these graphs are very different from one another, it indicates that there isn't enough signal in the data to distinguish between these models. This is unfortunately not uncommon, especially when you avoid overfitting to any one set of SNPs (like all available SNPs). An alternative scenario is that the best fitting model has a much better score than all alternative models. If that is the case, and you are confident that this model is less accurate than another model with a worse score, then there could be bias in the data. But in my experience the first scenario (not enough signal relative to the model complexity) is a more common problem.
    Another thing to keep in mind is that the plotting function doesn't scale the length of the edges by their weight, which can be misleading in that it makes similar models look very different. Internal edges with low weights indicate that the topology around these edges is not strongly supported by the data.

rossibarra commented 1 year ago

the graph and admix constraints work as advertised as far as I can see, which is great! what would be wonderful -- and perhaps i'm just missing how to do this! -- is a way to constrain that an admix donor occurs after some split. so i want all possible nodes that contribute admixture to pop C to have an origin in the graph after pops A and B split. Is this possible?

uqrmaie1 commented 1 year ago

It wasn't possible so far, but to some extent it should be possible now, using an updated version of the satisfies_eventorder() function. This function tests the order of population split events. The order in which populations split is not unambiguous in every graph. The default behavior of satisfies_eventorder() uses one definition, and that definition isn't useful in the example you gave. I added another, stricter definition, which can be triggered by setting the optional type column in the constraint data frame to 2.

Here is an example for how you should be able to use it to generate graphs where pop C is admixed with a source that originates in the graph after pops A and B split:

constrain_admix = tribble(
  ~pop, ~min, ~max,
  'C', 1, NA)

constrain_events = tribble(
  ~earlier1, ~earlier2, ~later1, ~later2, ~type,
  'A', 'B', 'C', NA, 2)

graph = random_admixturegraph(5, 1, admix_constraints = constrain_admix, event_order = constrain_events)

This works by creating random graphs and filtering out those where satisfies_eventorder(graph, constrain_events) or satisfies_numadmix(graph, constrain_admix) evaluate to FALSE. find_graphs() uses these functions in a similar way.