lutteropp / NetRAX

Phylogenetic Network Inference without ILS
GNU General Public License v3.0
17 stars 1 forks source link

Snakes Empirical Dataset #82

Open lutteropp opened 3 years ago

lutteropp commented 3 years ago

I have found an empirical snakes dataset mentioned in this dissertation, on page 56. Allen-Savietta compares her model for evolutionary rate estimation with the evolutionary rates on this dataset. She does not run PhyLiNC on it.

The dataset was sequenced in this paper, and people inferred a network for it in that paper.

What we know about the dataset:

In her thesis, Allen-Savietta shares shows this network with 2 reticulations, inferred using the ILS-aware SnaQ tool: Screenshot from 2021-08-08 14-49-18

But when I take a look at the paper her thesis cites when saying where the network comes from, I see a different network there: Screenshot from 2021-08-08 14-51-59

I cannot find networks in Extended Newick Format for any of these networks. Only these pictures.

lutteropp commented 3 years ago

Detailed RAxML-NG output on the high fraction of invariant sites per gene attached. The genes have between twenty-something and one-hundred-something MSA patterns. raxml_snakes_sites_report.txt

lutteropp commented 3 years ago

This is the partitioned MSA I created for use with NetRAX, using this and this quick little script I wrote: snakes_for_netrax.zip

lutteropp commented 3 years ago

I want to run NetRAX on this dataset, using the phobos lab cluster. I am doing different search variants:

lutteropp commented 3 years ago

TODOs listed by @celinescornavacca in the Slack channel:

lutteropp commented 3 years ago

find out the kind of sequences they use

From this paper: "Anchored hybrid enrichment data were generated and aligned in Chen et al. (2017) following the procedures of Lemmon et al. (2012). We generated hundreds of long loci for 23 species of Lampropeltis and the outgroup Cemophora coccinea (Chen et al. 2017; Supplementary data available on Dryad at https://datadryad.org//resource/doi:10.5061/dryad.4qs50."

lutteropp commented 3 years ago

NetRAX results on the phobos lab cluster for the snakes dataset:

lutteropp commented 3 years ago

@celinescornavacca I figured out where this 2-reticulation network from the thesis comes from. It says it comes from the SNAQ inference from the paper, but the paper clearly says that SNAQ inferred either 1 or 6 reticulations, without being able to say which one is better according to its model. Then the paper goes on and decides that there is 1 reticulation in this snakes dataset. The paper then also goes on and does some stuff with neural networks which I don't understand.

However, here's some interesting quotes from the paper:

I found the 2-reticulation network in this figure, where we have networks spanning from 2 to 10 reticulations in there. FigS3_H2-H10.pdf

Combined with the remaining information from the paper, it appears to be just one out of many networks with different reticulation count inferred by SNAQ. It's also clear what's going on: How much gene flow/ likelihood improvement do you require for something to be considered a reticulation? It's a standard model complexity problem. We solve this problem in NetRAX by using BIC.

---> I hereby conclude that the 1-reticulation network makes the most sense. But keep in mind that these networks all just came from SNAQ (an ILS-aware network inference tool that uses Pseudolikeihood). It's still not a "true" network in any kind...

lutteropp commented 3 years ago

TL;DR: The 2-reticulation network from the PhyLiNC thesis is just one of many networks with different reticulation counts proposed by the SNAQ tool. It is not the network that "wins" the SNAQ inference.

lutteropp commented 3 years ago

I've got an idea: Maybe I can simply redo the SNAQ analysis on the phobos lab server (the same I used for running NetRAX on the snakes dataset), then we will get the NEWICK from the pictures and also we will then be able to compare NetRAX runtime with SNAQ runtime.

However, there is a problem with this idea: SNAQ was used in the snakes paper with some weird concordance factor table. The authors had a complicated multi-step pipeline calling multiple tools, they did not upload all the data, and thus I cannot properly reproduce their results.

If we would run SNAQ from gene trees inferred by RAxML-NG instead, we would likely end up with yet another network...

lutteropp commented 3 years ago

Hand-written Extended NEWICK files for the snakes network from the paper (1 reticulation) and from the dissertation (2 reticulations): snakes_network_from_paper.txt snakes_network_from_dissertation.txt

lutteropp commented 3 years ago

Here are just the network files for the snakes dataset: snakes_network_from_paper.txt snakes_network_from_dissertation.txt snakes_multi_average_inferred_network.txt snakes_single_average_inferred_network.txt snakes_single_best_inferred_network.txt snakes_multi_best_inferred_network.txt

lutteropp commented 3 years ago

I don't believe that we should compute distance to the network from the dissertation, as it is just one of many intermediary SNAQ results.Thus, I will only report distances to the network from the paper.

lutteropp commented 3 years ago

Dendroscope pictures for all these networks:

lutteropp commented 3 years ago

Command line output when comparing BIC and topological distances. Turns out NetRAX found a better network in all cases, regarding BIC score. The relative unrooted softwired cluster distance to the network from the paper is near-zero. judge_output_multi_average.txt judge_output_single_average.txt judge_output_multi_best.txt judge_output_single_best.txt

lutteropp commented 3 years ago

Done. I added the results table and evaluation to the paper draft, with more detailed results table and network pictures in the supplement.

celinescornavacca commented 3 years ago

snakes.pdf The five networks in an image (I removed the dissertation one)

lutteropp commented 3 years ago

The snakes MSA and partitions snakes_msa.fasta.txt snakes_partitions.txt

lutteropp commented 3 years ago

Turns out the 2-reticulations network also scores better under LikelihoodModel.AVERAGE

sarah@gram-3:~/code-workspace/NetRAX/experiments/assemble_snakes$ /home/sarah/code-workspace/NetRAX/bin/netrax --msa snakes_network_files/snakes_msa.fasta --model snakes_network_files/snakes_partitions.txt --judge_only --start_network snakes_network_files/snakes_multi_best_inferred_network.txt --judge snakes_network_files/snakes_single_average_inferred_network.txt --average_displayed_tree_variant
optimizing model, reticulation probs, and branch lengths (slow mode)...
BIC score after model optimization: 1498686.027
BIC score after updating reticulation probs: 1498686.027
BIC score after branch length optimization: 1498570.568
improved bic: 1498570.568
BIC score after updating reticulation probs: 1498567.941
BIC score after model optimization: 1498565.853
BIC score after updating reticulation probs: 1498565.853
BIC score after branch length optimization: 1498548.965
improved bic: 1498548.965
BIC score after updating reticulation probs: 1498548.965
optimizing model, reticulation probs, and branch lengths (slow mode)...
BIC score after model optimization: 1499278.756
BIC score after updating reticulation probs: 1499278.756
BIC score after branch length optimization: 1499278.754
improved bic: 1499278.754
BIC score after updating reticulation probs: 1499278.754
BIC score after branch length optimization: 1499278.754
BIC score after model optimization: 1499276.908
BIC score after updating reticulation probs: 1499276.908
BIC score after branch length optimization: 1499276.906
improved bic: 1499276.906
BIC score after updating reticulation probs: 1499276.906
BIC score after branch length optimization: 1499276.906

Evaluation of inference results:
logl_inferred: -726749.5719
logl_true: -727145.8595
bic_inferred: 1498548.965
bic_true: 1499276.906
Inferred a better BIC.
Relative BIC difference (>0 means better): 0.0004855282632
n_reticulations inferred: 2
n_reticulations true: 1
Inferred more reticulations.
Unrooted softwired network distance: 0.275862069
Unrooted hardwired network distance: 0.4074074074
Unrooted displayed trees distance: 1
Rooted softwired network distance: 0.5945945946
Rooted hardwired network distance: 0.625
Rooted displayed trees distance: 1
Rooted tripartition distance: 0.7058823529
Rooted path multiplicity distance: 0.3965517241
Rooted nested labels distance: 0.7368421053

Total runtime: 82 seconds.