Open lutteropp opened 3 years ago
Detailed RAxML-NG output on the high fraction of invariant sites per gene attached. The genes have between twenty-something and one-hundred-something MSA patterns. raxml_snakes_sites_report.txt
This is the partitioned MSA I created for use with NetRAX, using this and this quick little script I wrote: snakes_for_netrax.zip
I want to run NetRAX on this dataset, using the phobos lab cluster. I am doing different search variants:
TODOs listed by @celinescornavacca in the Slack channel:
find out the kind of sequences they use
From this paper: "Anchored hybrid enrichment data were generated and aligned in Chen et al. (2017) following the procedures of Lemmon et al. (2012). We generated hundreds of long loci for 23 species of Lampropeltis and the outgroup Cemophora coccinea (Chen et al. 2017; Supplementary data available on Dryad at https://datadryad.org//resource/doi:10.5061/dryad.4qs50."
NetRAX results on the phobos lab cluster for the snakes dataset:
Start from RAxML-NG best ML tree, with LikelihoodModel.AVERAGE: Total inference runtime: 844.0 seconds. Best inferred network has 1 reticulations, logl = -727145.753, bic = 1499276.693 snakes_single_average_result.txt
Start from all unique trees with 10 random and 10 parsimony, with LikelihoodModel.AVERAGE: 14 unique start tree topologies. Total inference runtime: 15615.0 seconds. Best inferred network has 1 reticulations, logl = -727209.282, bic = 1499403.751 snakes_multi_average_result.txt
Start from RAxML-NG best ML tree, with LikelihoodModel.BEST Total inference runtime: 193.0 seconds. Best inferred network has 1 reticulations, logl = -727228.5853, bic = 1499442.358 snakes_single_best_result.txt
Start from all unique trees with 10 random and 10 parsimony, with LikelihoodModel.BEST: 14 unique start tree topologies. Total inference runtime: 4182.0 seconds. Best inferred network has 2 reticulations, logl = -726893.6198, bic = 1498837.061 snakes_multi_best_result.txt
@celinescornavacca I figured out where this 2-reticulation network from the thesis comes from. It says it comes from the SNAQ inference from the paper, but the paper clearly says that SNAQ inferred either 1 or 6 reticulations, without being able to say which one is better according to its model. Then the paper goes on and decides that there is 1 reticulation in this snakes dataset. The paper then also goes on and does some stuff with neural networks which I don't understand.
However, here's some interesting quotes from the paper:
I found the 2-reticulation network in this figure, where we have networks spanning from 2 to 10 reticulations in there. FigS3_H2-H10.pdf
Combined with the remaining information from the paper, it appears to be just one out of many networks with different reticulation count inferred by SNAQ. It's also clear what's going on: How much gene flow/ likelihood improvement do you require for something to be considered a reticulation? It's a standard model complexity problem. We solve this problem in NetRAX by using BIC.
---> I hereby conclude that the 1-reticulation network makes the most sense. But keep in mind that these networks all just came from SNAQ (an ILS-aware network inference tool that uses Pseudolikeihood). It's still not a "true" network in any kind...
TL;DR: The 2-reticulation network from the PhyLiNC thesis is just one of many networks with different reticulation counts proposed by the SNAQ tool. It is not the network that "wins" the SNAQ inference.
I've got an idea: Maybe I can simply redo the SNAQ analysis on the phobos lab server (the same I used for running NetRAX on the snakes dataset), then we will get the NEWICK from the pictures and also we will then be able to compare NetRAX runtime with SNAQ runtime.
However, there is a problem with this idea: SNAQ was used in the snakes paper with some weird concordance factor table. The authors had a complicated multi-step pipeline calling multiple tools, they did not upload all the data, and thus I cannot properly reproduce their results.
If we would run SNAQ from gene trees inferred by RAxML-NG instead, we would likely end up with yet another network...
Hand-written Extended NEWICK files for the snakes network from the paper (1 reticulation) and from the dissertation (2 reticulations): snakes_network_from_paper.txt snakes_network_from_dissertation.txt
Here are just the network files for the snakes dataset: snakes_network_from_paper.txt snakes_network_from_dissertation.txt snakes_multi_average_inferred_network.txt snakes_single_average_inferred_network.txt snakes_single_best_inferred_network.txt snakes_multi_best_inferred_network.txt
I don't believe that we should compute distance to the network from the dissertation, as it is just one of many intermediary SNAQ results.Thus, I will only report distances to the network from the paper.
Dendroscope pictures for all these networks:
Command line output when comparing BIC and topological distances. Turns out NetRAX found a better network in all cases, regarding BIC score. The relative unrooted softwired cluster distance to the network from the paper is near-zero. judge_output_multi_average.txt judge_output_single_average.txt judge_output_multi_best.txt judge_output_single_best.txt
Done. I added the results table and evaluation to the paper draft, with more detailed results table and network pictures in the supplement.
snakes.pdf The five networks in an image (I removed the dissertation one)
The snakes MSA and partitions snakes_msa.fasta.txt snakes_partitions.txt
Turns out the 2-reticulations network also scores better under LikelihoodModel.AVERAGE
sarah@gram-3:~/code-workspace/NetRAX/experiments/assemble_snakes$ /home/sarah/code-workspace/NetRAX/bin/netrax --msa snakes_network_files/snakes_msa.fasta --model snakes_network_files/snakes_partitions.txt --judge_only --start_network snakes_network_files/snakes_multi_best_inferred_network.txt --judge snakes_network_files/snakes_single_average_inferred_network.txt --average_displayed_tree_variant
optimizing model, reticulation probs, and branch lengths (slow mode)...
BIC score after model optimization: 1498686.027
BIC score after updating reticulation probs: 1498686.027
BIC score after branch length optimization: 1498570.568
improved bic: 1498570.568
BIC score after updating reticulation probs: 1498567.941
BIC score after model optimization: 1498565.853
BIC score after updating reticulation probs: 1498565.853
BIC score after branch length optimization: 1498548.965
improved bic: 1498548.965
BIC score after updating reticulation probs: 1498548.965
optimizing model, reticulation probs, and branch lengths (slow mode)...
BIC score after model optimization: 1499278.756
BIC score after updating reticulation probs: 1499278.756
BIC score after branch length optimization: 1499278.754
improved bic: 1499278.754
BIC score after updating reticulation probs: 1499278.754
BIC score after branch length optimization: 1499278.754
BIC score after model optimization: 1499276.908
BIC score after updating reticulation probs: 1499276.908
BIC score after branch length optimization: 1499276.906
improved bic: 1499276.906
BIC score after updating reticulation probs: 1499276.906
BIC score after branch length optimization: 1499276.906
Evaluation of inference results:
logl_inferred: -726749.5719
logl_true: -727145.8595
bic_inferred: 1498548.965
bic_true: 1499276.906
Inferred a better BIC.
Relative BIC difference (>0 means better): 0.0004855282632
n_reticulations inferred: 2
n_reticulations true: 1
Inferred more reticulations.
Unrooted softwired network distance: 0.275862069
Unrooted hardwired network distance: 0.4074074074
Unrooted displayed trees distance: 1
Rooted softwired network distance: 0.5945945946
Rooted hardwired network distance: 0.625
Rooted displayed trees distance: 1
Rooted tripartition distance: 0.7058823529
Rooted path multiplicity distance: 0.3965517241
Rooted nested labels distance: 0.7368421053
Total runtime: 82 seconds.
I have found an empirical snakes dataset mentioned in this dissertation, on page 56. Allen-Savietta compares her model for evolutionary rate estimation with the evolutionary rates on this dataset. She does not run PhyLiNC on it.
The dataset was sequenced in this paper, and people inferred a network for it in that paper.
What we know about the dataset:
In her thesis, Allen-Savietta shares shows this network with 2 reticulations, inferred using the ILS-aware SnaQ tool:
But when I take a look at the paper her thesis cites when saying where the network comes from, I see a different network there:
I cannot find networks in Extended Newick Format for any of these networks. Only these pictures.