lutteropp / NetRAX

Phylogenetic Network Inference without ILS
GNU General Public License v3.0
17 stars 0 forks source link

Comparison with PhyLiNC, PhyloDAG, SNAQ, PhyloNET MPL, and PhyloNET ML on simulated data #83

Open lutteropp opened 3 years ago

lutteropp commented 3 years ago

The complete PhyLiNC output for a simulated 10 taxon 1 reticulation 2000 MSA sites dataset, on the PhD laptop. I set the maximum number of reticulations it should try to 2, and turns out that PhyLiNC overshooted and inferred a 2-reticulation network then. As they use the unlinked sites original NEPAL likelihood model, this is expected. We had the same problem with that model back then. Also, PhyLiNC had some more issues and errors down the line. phylinc_output.txt

lutteropp commented 3 years ago

The simulated dataset, the RAxML-NG best ML tree, the PhyLINC inferred network, and networks inferrred by several NetRAX variants datasets_phylinc_exp_smaller.zip

lutteropp commented 3 years ago

PhyLINC result on the PhD laptop, with max_reticulations set to 2, start from RAxML-NG best ML tree: Total inference runtime: 38365.49 seconds. Inferred a network with 2 reticulations. Printed multiple error messages (ERROR found on PhyLiNC for run 5 seed 17293: │ RootMismatch: non-leaf node 22 had 0 children. │ Could be a hybrid whose parents' direction conflicts with the root. │ isChild1 and containRoot were updated for a subset of edges in the network only.)


NetRAX results on the PhD laptop for the simulated 10-taxon 1-reticulatiion dataset:

lutteropp commented 3 years ago

I am also including PhyloDAG in this comparison. Here the data to run PhyloDAG on the dataset: data_for_phylodag.zip

lutteropp commented 3 years ago

The PhyloDAG inference already finished. It took 3.308089 mins, ran only single-threaded, and inferred this network, with 1 reticulation and loglikelihood -17771.85: Screenshot from 2021-08-14 14-32-09

lutteropp commented 3 years ago

We need to also compare NetRAX and PhyloDAG on a larger dataset. Let's say 30 taxa, 3 reticulations. I am using the dataset from experiment D (the scrambling one) for it.

lutteropp commented 3 years ago

In this archive, we have:

data_for_phylodag_2.zip

lutteropp commented 3 years ago

I aborted the 30 taxa 3 reticulations run on PhyloDAG since it kept running for ages. Trying with a newly simulated 20 taxa 2 reticulations 4k MSA sites dataset now:

phylodag_data_20t2r.zip

lutteropp commented 3 years ago

Very interesting! PhyloDAG on 20 taxa 2 reticulations dataset finished, and it's result sucks really hard: Total runtime: 20.42073 mins

Inferred network picture: 20t2r_phylodag_network

lutteropp commented 3 years ago

NetRAX results on the PhD laptop for the simulated 20-taxon 2-reticulatiion dataset:

lutteropp commented 3 years ago

I retried PhyloDAG with their default parameters (before I used the parameters stated in their example file). This time, I got:

lutteropp commented 3 years ago

RAxML-NG best tree ML inference runtime (starting from 10 random + 10 parsimony trees) on the PhD laptop was:

lutteropp commented 3 years ago

I hand-wrote the Extended NEWICK for the PhyloDAG network, for the 10 taxa 1 reticulation dataset, using its default parameters: phylodag_10t1r_inferred_network.txt

lutteropp commented 3 years ago

Re-running PhyloDAG with the same parameters gives me totally different networks every time.

lutteropp commented 3 years ago

I also started yet another PhyLiNC inference run on the simulated 10 taxa 1 reticulation dataset, this time with telling it that the maximum number of reticulations to try is 1. It is currently still running, I expect it to take multiple hours, but less than a day on the PhD laptop.

lutteropp commented 3 years ago

The PhyLiNC output with maximum number of reticulations set to 1, this time without any weird error messages: phylinc_output_maxret_1.txt

lutteropp commented 3 years ago

The 10 taxa, 1 reticulation dataset: 0_0_msa.txt 0_0_partitions.txt 0_0.raxml.bestTree.txt 0_0.raxml.startTree.unique.txt

Here are just the network files for the 10 taxa, 1 reticulation dataset: 0_0_true_network.txt 0_0_single_best_inferred_network.txt 0_0_single_average_inferred_network.txt 0_0_multi_best_inferred_network.txt 0_0_multi_average_inferred_network.txt 0_0_phylinc_inferred_network.txt 0_0_phylinc_default_inferred_network.txt phylodag_10t1r_inferred_network.txt

The 20 taxa, 2 reticulation dataset: 20t_2r_msa.txt 20t_2r_partitions.txt 20t_2r.raxml.bestTree.txt 20t_2r.raxml.startTree.unique.txt

Here are just the network files for the 20 taxa, 2 reticulations dataset: 20t_2r_true_network.txt 20t_2r_single_best_inferred_network.txt 20t_2r_single_average_inferred_network.txt 20t_2r_multi_best_inferred_network.txt 20t_2r_multi_average_inferred_network.txt (I refuse to hand-write the 14 reticulation network inferred by PhyloDAG)

lutteropp commented 3 years ago

I hate extra work, but it would be awesome if we would also compare with SNAQ, PhyloNet ML, and PhyloNet PseudoML on our simulated data. Instead of a MSA, these tools require a set of gene trees. Since we have very few "genes" here (just 2^num_reticulations), I expect the tools to be pretty fast.

As a first step for these inferences, I am inferring the "gene trees" with RAxML-NG, using the PhD laptop.

lutteropp commented 3 years ago

First, the per-gene MSAs, built through variations of this very nice and useful command: awk '{if(/^>/)print $0; else print substr($0,1,1000)}' 20t_2r_msa.txt > 20t_2r_gene1_msa.txt

For the 10 taxa, 1 reticulation dataset: 10t_1r_gene2_msa.txt 10t_1r_gene1_msa.txt

For the 20 taxa, 2 reticulations dataset: 20t_2r_gene4_msa.txt 20t_2r_gene3_msa.txt 20t_2r_gene2_msa.txt 20t_2r_gene1_msa.txt

lutteropp commented 3 years ago

These are the "gene trees" inferred by RAxML-NG, and the logfiles:

For the 10 taxa, 1 reticulation dataset: 10t_1r_gene2_msa.txt.raxml.bestTree.txt 10t_1r_gene1_msa.txt.raxml.bestTree.txt

10t_1r_gene2_msa.txt.raxml.log.txt 10t_1r_gene1_msa.txt.raxml.log.txt

For the 20 taxa, 2 reticulations dataset: 20t_2r_gene4_msa.txt.raxml.bestTree.txt 20t_2r_gene3_msa.txt.raxml.bestTree.txt 20t_2r_gene2_msa.txt.raxml.bestTree.txt 20t_2r_gene1_msa.txt.raxml.bestTree.txt

20t_2r_gene4_msa.txt.raxml.log.txt 20t_2r_gene3_msa.txt.raxml.log.txt 20t_2r_gene2_msa.txt.raxml.log.txt 20t_2r_gene1_msa.txt.raxml.log.txt

Total RAxML inference runtimes for the "gene trees":

lutteropp commented 3 years ago

Apparently SNAQ requires a set of gene trees in 1 file, and 1 start tree. So here's the input data for SNAQ:

lutteropp commented 3 years ago

Here are the SNAQ results for the 10 taxa 1 reticulation dataset.

lutteropp commented 3 years ago
lutteropp commented 3 years ago

Turns out both SNAQ and PhyLiNC overestimate the number of reticulations if I tell them to try for at most 2 reticulations.

lutteropp commented 3 years ago

NEXUS Submission files for PhyloNET, for the 10 taxa 1 reticulation dataset: 10_1r_phylonet_submission_files.zip

lutteropp commented 3 years ago

PhyloNET MPL (Maximum Pseudolikelihood) results for the 10 taxa 1 reticulation dataset:

PhyloNET ML (Maximum Likelihood) results for the 10 taxa 1 reticulation dataset:

lutteropp commented 3 years ago

The judge results, for all networks on the 10 taxa 1 reticulation dataset we have so far: (with SNAQ, I had to manually fix the inferred 2-reticulation network that had one reticulation with probability 0/1) judge_phylonet_ml_maxret_1.txt judge_phylonet_mpl_maxret_2.txt judge_phylonet_mpl_maxret_1.txt judge_snaq_maxret_2.txt judge_snaq_maxret_1.txt judge_phylodag.txt judge_phylinc_maxret_2.txt judge_phylinc_maxret_1.txt judge_netrax_single_average.txt judge_netrax_multi_average.txt judge_netrax_multi_best.txt judge_netrax_single_best.txt

lutteropp commented 3 years ago

PhyloNET ML with 2 reticulations max on the 10 taxa 1 reticulation dataset finished its first out of 5 runs on the PhD laptop (it inferred a 2-reticulation network). It took 3 hours for that single run, already running in parallel with 4 threads! Thus, this inference will likely be finished in about 12 hours from now.

lutteropp commented 3 years ago

No more progress on the PhyloNET ML with 2 reticulations max run. I cannot tell if it maybe got stuck in an endless loop or so, it does not print any progress output to the command line.

lutteropp commented 3 years ago

Output and inferred network for PhyloNET MP

lutteropp commented 3 years ago

And the judge results for PhyloNET MP: judge_phylonet_mp_maxret_1.txt judge_phylonet_mp_maxret_2.txt

lutteropp commented 3 years ago

Output and inferred network for PhyloNET ML, with max_reticulations = 2: 10t_1r_phylonet_maxret_2_ml_output.txt 10t_1r_phylonet_maxret_2_ml_inferred_network.txt

lutteropp commented 3 years ago

The judge result for PhyloNET ML, with max_reticulations = 2: judge_phylonet_ml_maxret_2.txt

lutteropp commented 3 years ago

PhyloNET MP on the simulated 20 taxa 2 reticulations dataset, with max reticulations set to 2, nonetheless inferred only a single reticulation: 20_2r_phylonet_maxret_2_output.txt 20_2r_phylonet_maxret_2_inferred_network.txt

lutteropp commented 2 years ago

Complete judge output for the 20t_2r data 20t_2r_judge_output.txt

TL;DR: