Snakes Empirical Dataset

lutteropp commented 3 years ago

I have found an empirical snakes dataset mentioned in this dissertation, on page 56. Allen-Savietta compares her model for evolutionary rate estimation with the evolutionary rates on this dataset. She does not run PhyLiNC on it.

The dataset was sequenced in this paper, and people inferred a network for it in that paper.

What we know about the dataset:

From her thesis: "In 2017, Chen et al. aligned sequences for 23 species, resulting in 304 long loci, concatenated into 453 kilobase pairs. When summarized, the concatenated alignment contains 6737 unique site patterns."
From concatenating the per-gene MSAs from this Dryad repo into a partitioned MSA NetRAX can use: 304 genes, some sequences have gap characters in them, the genes have very few MSA patterns each.

In her thesis, Allen-Savietta shares shows this network with 2 reticulations, inferred using the ILS-aware SnaQ tool: Screenshot from 2021-08-08 14-49-18

But when I take a look at the paper her thesis cites when saying where the network comes from, I see a different network there: Screenshot from 2021-08-08 14-51-59

I cannot find networks in Extended Newick Format for any of these networks. Only these pictures.

lutteropp commented 3 years ago

Detailed RAxML-NG output on the high fraction of invariant sites per gene attached. The genes have between twenty-something and one-hundred-something MSA patterns. raxml_snakes_sites_report.txt

lutteropp commented 3 years ago

This is the partitioned MSA I created for use with NetRAX, using this and this quick little script I wrote: snakes_for_netrax.zip

lutteropp commented 3 years ago

I want to run NetRAX on this dataset, using the phobos lab cluster. I am doing different search variants:

Start from RAxML-NG best ML tree, with LikelihoodModel.BEST
Start from RAxML-NG best ML tree, with LikelihoodModel.AVERAGE
Start from all unique trees with 10 random and 10 parsimony, with LikelihoodModel.BEST
Start from all unique trees with 10 random and 10 parsimony, with LikelihoodModel.AVERAGE

lutteropp commented 3 years ago

TODOs listed by @celinescornavacca in the Slack channel:

[x] find out the kind of sequences they use
[x] write the newick for their network
[x] write the newick from the thesis
[x] infer your networks
[x] compare your networks with these 2 networks
[x] try to understand which network makes more sense between the one in 2) and the one in 3)

lutteropp commented 3 years ago

find out the kind of sequences they use

From this paper: "Anchored hybrid enrichment data were generated and aligned in Chen et al. (2017) following the procedures of Lemmon et al. (2012). We generated hundreds of long loci for 23 species of Lampropeltis and the outgroup Cemophora coccinea (Chen et al. 2017; Supplementary data available on Dryad at https://datadryad.org//resource/doi:10.5061/dryad.4qs50."

lutteropp commented 3 years ago

NetRAX results on the phobos lab cluster for the snakes dataset:

Start from RAxML-NG best ML tree, with LikelihoodModel.AVERAGE: Total inference runtime: 844.0 seconds. Best inferred network has 1 reticulations, logl = -727145.753, bic = 1499276.693 snakes_single_average_result.txt
Start from all unique trees with 10 random and 10 parsimony, with LikelihoodModel.AVERAGE: 14 unique start tree topologies. Total inference runtime: 15615.0 seconds. Best inferred network has 1 reticulations, logl = -727209.282, bic = 1499403.751 snakes_multi_average_result.txt
Start from RAxML-NG best ML tree, with LikelihoodModel.BEST Total inference runtime: 193.0 seconds. Best inferred network has 1 reticulations, logl = -727228.5853, bic = 1499442.358 snakes_single_best_result.txt
Start from all unique trees with 10 random and 10 parsimony, with LikelihoodModel.BEST: 14 unique start tree topologies. Total inference runtime: 4182.0 seconds. Best inferred network has 2 reticulations, logl = -726893.6198, bic = 1498837.061 snakes_multi_best_result.txt

lutteropp commented 3 years ago

@celinescornavacca I figured out where this 2-reticulation network from the thesis comes from. It says it comes from the SNAQ inference from the paper, but the paper clearly says that SNAQ inferred either 1 or 6 reticulations, without being able to say which one is better according to its model. Then the paper goes on and decides that there is 1 reticulation in this snakes dataset. The paper then also goes on and does some stuff with neural networks which I don't understand.

However, here's some interesting quotes from the paper:

"While we strongly detect only a single reticulating event during the late Neogene/Early Pleistocene among taxa distributed in southwestern US and northwestern Mexico, it is possible that other events with a lower fraction of genomic introgression may have occurred (see other trees in Supplemental Fig. S3 available on Dryad)."
"The most accurate representation of the tree of life should depict any minor reticulation, however, practical considerations might only include those where the reticulation results in more substantial transfer of both genes and phenotype. "
" it is probable that smaller fractions of the genome can be detected as reticulating. For instance, here with a typical Lampropeltis genome size of 2.72 GBp (De Smet 1981), even 0.1% horizontal gene transfer would still result in the introgression of 270 Mbp."
"The lack of balance poses an essential yet unresolved problem for phylogenetic representation: how much of the ancestral genome must be involved in reticulation for that area of the tree to not be considered bifurcating?"

I found the 2-reticulation network in this figure, where we have networks spanning from 2 to 10 reticulations in there. FigS3_H2-H10.pdf

Combined with the remaining information from the paper, it appears to be just one out of many networks with different reticulation count inferred by SNAQ. It's also clear what's going on: How much gene flow/ likelihood improvement do you require for something to be considered a reticulation? It's a standard model complexity problem. We solve this problem in NetRAX by using BIC.

---> I hereby conclude that the 1-reticulation network makes the most sense. But keep in mind that these networks all just came from SNAQ (an ILS-aware network inference tool that uses Pseudolikeihood). It's still not a "true" network in any kind...

lutteropp commented 3 years ago

TL;DR: The 2-reticulation network from the PhyLiNC thesis is just one of many networks with different reticulation counts proposed by the SNAQ tool. It is not the network that "wins" the SNAQ inference.

lutteropp commented 3 years ago

I've got an idea: Maybe I can simply redo the SNAQ analysis on the phobos lab server (the same I used for running NetRAX on the snakes dataset), then we will get the NEWICK from the pictures and also we will then be able to compare NetRAX runtime with SNAQ runtime.

However, there is a problem with this idea: SNAQ was used in the snakes paper with some weird concordance factor table. The authors had a complicated multi-step pipeline calling multiple tools, they did not upload all the data, and thus I cannot properly reproduce their results.

If we would run SNAQ from gene trees inferred by RAxML-NG instead, we would likely end up with yet another network...

lutteropp commented 3 years ago

Hand-written Extended NEWICK files for the snakes network from the paper (1 reticulation) and from the dissertation (2 reticulations): snakes_network_from_paper.txt snakes_network_from_dissertation.txt

lutteropp commented 3 years ago

Here are just the network files for the snakes dataset: snakes_network_from_paper.txt snakes_network_from_dissertation.txt snakes_multi_average_inferred_network.txt snakes_single_average_inferred_network.txt snakes_single_best_inferred_network.txt snakes_multi_best_inferred_network.txt

lutteropp commented 3 years ago

I don't believe that we should compute distance to the network from the dissertation, as it is just one of many intermediary SNAQ results.Thus, I will only report distances to the network from the paper.

lutteropp commented 3 years ago

Dendroscope pictures for all these networks:

from_paper:
from_dissertation:
single_average:
single_best:
multi_average:
multi_best:

lutteropp commented 3 years ago

Command line output when comparing BIC and topological distances. Turns out NetRAX found a better network in all cases, regarding BIC score. The relative unrooted softwired cluster distance to the network from the paper is near-zero. judge_output_multi_average.txt judge_output_single_average.txt judge_output_multi_best.txt judge_output_single_best.txt

lutteropp commented 3 years ago

Done. I added the results table and evaluation to the paper draft, with more detailed results table and network pictures in the supplement.

celinescornavacca commented 3 years ago

snakes.pdf The five networks in an image (I removed the dissertation one)

lutteropp commented 3 years ago

The snakes MSA and partitions snakes_msa.fasta.txt snakes_partitions.txt

lutteropp commented 3 years ago

Turns out the 2-reticulations network also scores better under LikelihoodModel.AVERAGE

sarah@gram-3:~/code-workspace/NetRAX/experiments/assemble_snakes$ /home/sarah/code-workspace/NetRAX/bin/netrax --msa snakes_network_files/snakes_msa.fasta --model snakes_network_files/snakes_partitions.txt --judge_only --start_network snakes_network_files/snakes_multi_best_inferred_network.txt --judge snakes_network_files/snakes_single_average_inferred_network.txt --average_displayed_tree_variant
optimizing model, reticulation probs, and branch lengths (slow mode)...
BIC score after model optimization: 1498686.027
BIC score after updating reticulation probs: 1498686.027
BIC score after branch length optimization: 1498570.568
improved bic: 1498570.568
BIC score after updating reticulation probs: 1498567.941
BIC score after model optimization: 1498565.853
BIC score after updating reticulation probs: 1498565.853
BIC score after branch length optimization: 1498548.965
improved bic: 1498548.965
BIC score after updating reticulation probs: 1498548.965
optimizing model, reticulation probs, and branch lengths (slow mode)...
BIC score after model optimization: 1499278.756
BIC score after updating reticulation probs: 1499278.756
BIC score after branch length optimization: 1499278.754
improved bic: 1499278.754
BIC score after updating reticulation probs: 1499278.754
BIC score after branch length optimization: 1499278.754
BIC score after model optimization: 1499276.908
BIC score after updating reticulation probs: 1499276.908
BIC score after branch length optimization: 1499276.906
improved bic: 1499276.906
BIC score after updating reticulation probs: 1499276.906
BIC score after branch length optimization: 1499276.906

Evaluation of inference results:
logl_inferred: -726749.5719
logl_true: -727145.8595
bic_inferred: 1498548.965
bic_true: 1499276.906
Inferred a better BIC.
Relative BIC difference (>0 means better): 0.0004855282632
n_reticulations inferred: 2
n_reticulations true: 1
Inferred more reticulations.
Unrooted softwired network distance: 0.275862069
Unrooted hardwired network distance: 0.4074074074
Unrooted displayed trees distance: 1
Rooted softwired network distance: 0.5945945946
Rooted hardwired network distance: 0.625
Rooted displayed trees distance: 1
Rooted tripartition distance: 0.7058823529
Rooted path multiplicity distance: 0.3965517241
Rooted nested labels distance: 0.7368421053

Total runtime: 82 seconds.

lutteropp / NetRAX

Snakes Empirical Dataset #82