lutteropp / NetRAX

Phylogenetic Network Inference without ILS
GNU General Public License v3.0
17 stars 1 forks source link

Simulation protocol #3

Open celinescornavacca opened 3 years ago

celinescornavacca commented 3 years ago

We are going to simulate our networks under the birth-hybridization process of Zhang et al 2018. In this process, we have a speciation rate l (lambda), a hybridisation rate v (nu) and the time we run the process t_0.

From their paper: "The simulator starts from the time of origin (t_0) with one species. A species splits into two (speciation) with rate l, and two species merge into one (hybridization) with rate v. At the moment of k branches, the total rate of change is r_tot = k l + choose (k 2) v. We generate a waiting time ~exp(r_tot) and a random variable u ~U(0,1), If u < k*l=r_tot, we randomly select a branch to split; otherwise, we randomly select two branches to join, and generate an inheritance probability c ~U(0,1). The simulator stops at time t_0."

Zhang et al. - 2018 - Bayesian Inference of Species Networks from Multil.pdf

No transfer from now, because branches with no length cannot be recovered.

Given such generated network, we extract a random tree, using the inheritance probabilities at the reticulated nodes: for each reticulated node x, we draw a number y ~ U(0,1) and we choose the first parent p_x if y < inheritanceProb of the edge p_x,x. If we do this,we have atree but not a binary one. Then we clean the tree from the dead nodes and 1-indegree 1-outdegree nodes, as Sarah explained in her document, and we obtain a binary tree. On this binary tree, we simulate N sequences with Seq-Gen-1.3.4 (currently under this model -mHKY -t3.0 -f0.3,0.2,0.2,0.3 this can be changed). We do this K times, obtaining an alignment of length K*N in which each group of N sites has its tree.

celinescornavacca commented 3 years ago

In their paper, they use this parameters in the simulations

\lambda - \nu ~ exp(0.1) with mean 10 \tau_0 ~ exp(10) with mean 0.1 \nu / \lambda ~ Beta distribution with alpha = 1 and beta = 1 (same as uniform distribution between 0 and 1) \gamma ~ Uniform distribution (between 0 and 1)

As I said in a slack message, this process gives Yule-type networks, that act as the Yule-type trees, see example below for trees:

index

Once they start having a certain amount of nodes, things get crazy and they start speciating and hybridising a lot. It is not the focus of our paper to correctly simulate networks (we are not in a Bayesian framework, we do not need priors) and this is the best we can for now (for the next paper, we will do better, working on it). So we can simply throw the networks we do not like as done in line 143-147 of this simulator, which only simulate networks and not trees and sequences.

RandomNetwork.txt (in txt because py is not supported)

Sarah, you will find the parameters they use in the paper on lines 8-17, you can copy and paste this in your simulator and add a line similar to line 143, and you should be good.

celinescornavacca commented 3 years ago

I did play with the simulator for a while and I think that I found a way to set the parameters that gives in average networks with enough leaves and not too many reticulations.

I also suggest to use the ratio no_of_hybrids/no_of_leaves to do our choices (line 148) Note that the choice of parameters has to be done in the "while simulateNetwork:" loop (line 46-48), not before as done in my previous file

RandomNetwork.txt

lutteropp commented 3 years ago

Great, it works! :+1: It is very slow, but it works! (The slightly adapted version is in https://github.com/lutteropp/NetRAX/blob/master/simulator/src/network/logic/celine_simulator.py)

Due to runtime issues, it is not realistic to exactly pre-specify wanted number of taxa and wanted number of reticulations with this simulator. Thus, we are categorizing the simulated datasets (accepting every network with at least 30 taxa and up to 10% reticulations) into low/medium/large buckets instead. For generating a set of acceptable datasets, we use https://github.com/lutteropp/NetRAX/blob/master/simulator/src/network/logic/simulate_celine_endless.py

Our current simulation setup is:

As a first step (to test the pipeline), we will also do the simulations on tiny networks (up to 10 taxa, up to 2 reticulations).

Slightly later, we will also run NetRAX with raxml-ng best tree (to check performance implications).

Much later (for second paper):

The entire simulation pipeline can currently be found here: https://github.com/lutteropp/NetRAX/tree/master/simulator/src/network/logic

stamatak commented 3 years ago

On 19.11.20 14:29, Sarah Lutteropp wrote:

Great, it works! 👍 It is very slow, but it works! (The slightly adapted version is in https://github.com/lutteropp/NetRAX/blob/master/simulator/src/network/logic/celine_simulator.py)

Due to runtime issues, it is not realistic to exactly pre-specify wanted number of taxa and wanted number of reticulations with this simulator. Thus, we are categorizing the simulated datasets (accepting every network with more than 30 taxa and up to 10% reticulations) into low/medium/large buckets instead. For generating a set of acceptable datasets, we use https://github.com/lutteropp/NetRAX/blob/master/simulator/src/network/logic/simulate_celine_endless.py

Our current simulation setup is:

  • categorize simulated networks into low/medium/high buckets (in terms of #taxa/#reticulations)
  • use 5 random starting points in NetRAX
  • of trees to simulate: each displayed tree

  • MSA-partition-size: 500, 1000 sites per tree
  • use both likelihood functions (average + best)

As a first step (to test the pipeline), we will also do the simulations on tiny networks (up to 10 taxa, up to 2 reticulations).

Slightly later, we will also run NetRAX with raxml-ng best tree (to check performance implications).

Much later (for second paper):

  • 3 simulators (celine, sarah, empirical datasets)
  • identifiability issue, different networks can have same displayed trees --> graph isomorphism on rooted DAGs for checking network topology

The entire simulation pipeline can currently be found here: https://github.com/lutteropp/NetRAX/tree/master/simulator/src/network/logic

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/3#issuecomment-730342835, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6XKIBIXF7PSO2LZ3JDSQUFQ5ANCNFSM4TRW4E2Q.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

lutteropp commented 3 years ago

There is one problem with our simulation setup from above - it ignores the displayed tree probabilities.

I implemented the following sampling types:

class SamplingType(Enum):
    STANDARD = 1 # randomly choose which tree to sample, then sample equal number of sites for each sampled tree - this is the only mode that uses the n_trees or m parameter for sampling
    PERFECT_SAMPLING = 2 # sample each displayed tree, and as many site as expected by the tree probability
    PERFECT_UNIFORM_SAMPLING = 3 # sample each displayed tree, with the same number of sites per tree (ignoring reticulation probabilities)
    SINGLE_SITE_SAMPLING = 4 # sample each site individually, with the reticulation probabilities in mind

What I wrote above would be PERFECT_UNIFORM_SAMPLING. I am running the small dataset simulations with this for now, but I expect NetRAX to not infer the correct reticulation probabilities in this setting. Instead, I believe PERFECT_SAMPLING is better. What are your thoughts on this?

stamatak commented 3 years ago

I'd tend to agree that perfect_sampling is better

alexis

On 20.11.20 21:52, Sarah Lutteropp wrote:

There is one problem with our simulation setup, and that is that it ignores the displayed tree probabilities.

I implemented the following sampling types:

|class SamplingType(Enum): STANDARD = 1 # randomly choose which tree to sample, then sample equal number of sites for each sampled tree - this is the only mode that uses the n_trees or m parameter for sampling PERFECT_SAMPLING = 2 # sample each displayed tree, and as many site as expected by the tree probability PERFECT_UNIFORM_SAMPLING = 3 # sample each displayed tree, with the same number of sites per tree (ignoring reticulation probabilities) SINGLE_SITE_SAMPLING = 4 # sample each site individually, with the reticulation probabilities in mind |

What I wrote above would be PERFECT_UNIFORM_SAMPLING. I am running the small dataset simulations with this for now, but I expect NetRAX to not infer the correct reticulation probabilities in this setting. Instead, I believe PERFECT_SAMPLING is better. What are your thoughts on this?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/3#issuecomment-731375863, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6VXZ5IVVHMOQHDEGNLSQ3CHHANCNFSM4TRW4E2Q.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

celinescornavacca commented 3 years ago

Same here!

lutteropp commented 3 years ago

Okay, switched to PERFECT_SAMPLING as we all agree on that. Done in https://github.com/lutteropp/NetRAX/commit/addb906a4041f26dbba2e50be87fc2546e13d7b3.

(EDIT: Moved starting networks question to new issue)

stamatak commented 3 years ago

Well if that's the way it is, then let's add more starting networks ...

An additional interesting number here would be to compare the LnL of the true network and the inferred networks to see how far off we are.

Alexis

On 25.11.20 17:23, Sarah Lutteropp wrote:

Okay, switching to PERFECT_SAMPLING as we all agree on that.

Second problem: 5 random starting networks are not enough, even if the "true" network is a tree. RAxML-NG uses by default 10 parsimony + 10 random starting trees. I tried NetRAX (with only 5 random start networks) on a small (10 taxa) tree dataset and it got stuck in a local optimum.

We currently have no parsimony networks in NetRAX. But we definitely need more than five random start networks with the current network search algorithm (adding comparison to RAxML-NG and start from best RAxML-tree right now).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/3#issuecomment-733772491, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6X2AYV3A3W2EXM7ND3SRUONRANCNFSM4TRW4E2Q.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

lutteropp commented 3 years ago

@stamatak Already in there ;-) I copy-pasted the struct I use for the results, as this is the shortest code part where one can see what is currently evaluated

class Result:
    def __init__(self, dataset):
        self.dataset = dataset
        self.bic_true = 0
        self.logl_true = 0
        self.bic_inferred = 0
        self.logl_inferred = 0
        self.bic_inferred_with_raxml = 0
        self.logl_inferred_with_raxml = 0
        self.bic_raxml = 0
        self.logl_raxml = 0
        self.n_reticulations_inferred = 0
        self.n_reticulations_inferred_with_raxml = 0
        self.topological_distances = {}
        self.topological_distances_with_raxml = {}

(inferred_with_raxml means that RAxML best tree was used as single starting network)

celinescornavacca commented 3 years ago

Looking good!

lutteropp commented 3 years ago

This is what we currently measure: https://github.com/lutteropp/NetRAX/issues/14

lutteropp commented 3 years ago

I just realized that we may need both PERFECT_SAMPLING and PERFECT_UNIFORM_SAMPLING in our experiments:

stamatak commented 3 years ago

sounds good,

On 30.11.20 02:46, Sarah Lutteropp wrote:

I just realized that we need both PERFECT_SAMPLING and PERFECT_UNIFORM_SAMPLING in our experiments:

  • PERFECT_SAMPLING is better for comparing loglikelihood and BIC score.
  • PERFECT_UNIFORM_SAMPLING is better for comparing network topology. Because those Dendroscope topological distances do not take reticulation probabilities into account...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/3#issuecomment-735487875, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6SRVQSQOIH4B47UXVLSSLTN3ANCNFSM4TRW4E2Q.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

lutteropp commented 3 years ago

There are three different settings on how to handle partitions:

... It appears to me that we need to include all these three possibilities in our experiments, as soon as we have more than one partition in the MSA.

stamatak commented 3 years ago

hm yes, we do, but maybe linked is not so easy to implement for a network (and one also already needs to be careful when implementing this for standard ML), the easiest is probably UNLINKED

On 06.12.20 03:19, Sarah Lutteropp wrote:

There are three different settings on how to handle branch lengths:

  • PLLMOD_COMMON_BRLEN_UNLINKED: Each partition has its own independent branch lengths and reticulation probs.
  • PLLMOD_COMMON_BRLEN_LINKED: All partitions share the same branch lengths and reticulation probs.
  • PLLMOD_COMMON_BRLEN_SCALED: All partitions share the same branch lengths and reticulation probs. In addition to that, each partition has its own scaling factor by which it is allowed to multiply the branch lengths for it.

... It appears to me that we need all these three possibilities in our experiments.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/3#issuecomment-739439921, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6UJSWRGXMRFDHV75V3STLLYLANCNFSM4TRW4E2Q.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

lutteropp commented 3 years ago

@stamatak umm... all 3 versions are fully implemented in NetRAX already

stamatak commented 3 years ago

fine with me ... just commented

On 06.12.20 20:14, Sarah Lutteropp wrote:

@stamatak https://github.com/stamatak umm... all 3 versions are fully implemented in NetRAX already

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/3#issuecomment-739541142, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6QJ5LIVOS2TCS2F6W3STPCXXANCNFSM4TRW4E2Q.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

lutteropp commented 3 years ago

should I switch the experiments to run with UNLINKED then? So far I used LINKED.

stamatak commented 3 years ago

I believe unlinked/scaled make more sense, with unlinked there's always the danger of over-parametrizing as each partition addes a whole lot of additional free branch length parameters ...

On 06.12.20 20:31, Sarah Lutteropp wrote:

should I switch the experiments to run with UNLINKED then? So far I used LINKED.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/3#issuecomment-739543370, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6QGW227G6KRL2QTRXTSTPEY5ANCNFSM4TRW4E2Q.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

celinescornavacca commented 3 years ago

I would go with scaled but in the experiment linked should be enough since the data are linked (TODO for the second paper, use a scaled framework to simulate?)

PERFECT_UNIFORM_SAMPLING is better for scoring well when comparing network topology. Because those Dendroscope topological distances do not take reticulation probabilities into account...

I am not sure that I would do that, because we need in any case a way to check the gammas when reconstructing the network in the PERFECT_SAMPLING setting.

lutteropp commented 3 years ago

Cool, then let's stick to scaled (and if we under-estimate reticulations, switch to linked later) and only _PERFECTSAMPLING first :)

It is easier (in terms of total runtime, number of plots to look at, and interpretation) to not try every possible combination of parameters one could think of in the initial experiments...

stamatak commented 3 years ago

sounds good, still unlinked should also be tested, as we don't understand the effects of reticulations on branch length variance/differences yet

On 07.12.20 14:03, Sarah Lutteropp wrote:

Cool, then let's stick to /scaled/ (and if we under-estimate reticulations, switch to linked later) and only /PERFECT_SAMPLING/ first :)

It is easier (in terms of total runtime, number of plots to look at, and interpretation) to not try every possible combination of parameters one could think of in the initial experiments...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/3#issuecomment-739875710, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6QDYZSGGOZCAE6S6KLSTTABRANCNFSM4TRW4E2Q.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

celinescornavacca commented 3 years ago

We can test it after the scaled one, @stamatak, to avoid the burden to test them at the same time?

lutteropp commented 3 years ago

Yes, let's make a separate experiment that only looks at the effect of scaled vs. unlinked brlens :) EDIT: I created an issue for it: https://github.com/lutteropp/NetRAX/issues/28

lutteropp commented 3 years ago

(I can do the runs with tons of different parameters on the cluster, but my problem is interpreting, plotting, and summarizing all these results for so many different parameter combinations at once; and still seeing the connections there... it's making tons of plots)

lutteropp commented 3 years ago

(also, having to entirely re-run some broken experiment is more painful when it was on a lot of parameter combinations)

stamatak commented 3 years ago

I'd do it right away and would just push everything on the cluster, as I don't have an intuition how this will behave, it's better to just conduct all runs at once

On 07.12.20 15:40, Sarah Lutteropp wrote:

(also, having to entirely re-run some broken experiment is more painful when it was on a lot of parameter combinations)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/3#issuecomment-739924734, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6USQK5GJG3M743FSR3STTLNBANCNFSM4TRW4E2Q.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org