lutteropp / NetRAX

Phylogenetic Network Inference without ILS
GNU General Public License v3.0
17 stars 0 forks source link

Explanation of the current Experimental Results CSV Header #14

Open lutteropp opened 3 years ago

lutteropp commented 3 years ago

The current results CSV header consists of:

name | n_taxa | n_trees | n_reticulations | msa_size | sampling_type | simulation_type | likelihood_type | timeout | n_random_start_networks | n_parsimony_start_networks | start_from_raxml | celine_params | n_reticulations_inferred | bic_true | logl_true | bic_inferred | logl_inferred | bic_raxml | logl_raxml | rf_absolute_raxml | rf_relative_raxml | rf_absolute_inferred | rf_relative_inferred | near_zero_branches_raxml | hardwired_cluster_distance | softwired_cluster_distance | displayed_trees_distance | tripartition_distance | nested_labels_distance | path_multiplicity_distance | runtime_inference


- timeout: If no start network is specified and no number of random/parsimony start networks are specified, then a value larger than 0 means that NetRAX will continue searching from new random start networks until $(timeout) seconds have passed.
- n_random_start_networks: Number of random start trees for the NetRAX network search
- n_parsimony_start_networks: Number of parsimony start trees for the NetRAX network search
- start_from_raxml: If TRUE, run NetRAX search only from best ML tree inferred by raxml-ng. If FALSE, run NetRAX search from some random/parsimony start trees.
- celine_params: The parameters used by Celine's simulator, or empty otherwise.
- n_reticulations_inferred: Number of reticulations in the network inferred by NetRAX.
- bic_true: BIC score of the simulated network (using the specified likelihood_type).
- logl_true: Network loglikelihood score of the simulated network (using the specified likelihood_type).
- bic_inferred: BIC score of the network inferred by NetRAX (using the specified likelihood_type).
- logl_inferred: Network loglikelihood score of the network inferred by NetRAX (using the specified likelihood_type).
- bic_raxml: BIC score of the maximum likelihood tree inferred by raxml-ng (using the specified likelihood_type).
- logl_raxml: Network loglikelihood score of the maximum likelihood tree inferred by raxml-ng (using the specified likelihood_type).
- rf_absolute_raxml: Absolute RF distance between the maximum likelihood tree inferred by raxml-ng and the simulated network, if the simulated network has zero reticulations. Otherwise, this value is -1.
- rf_relative_raxml: Relative RF distance between the maximum likelihood tree inferred by raxml-ng and the simulated network, if the simulated network has zero reticulations. Otherwise, this value is -1.
- rf_absolute_inferred: Absolute RF distance between the network inferred by NetRAX and the simulated network, if both the simulated network and the network inferred by NetRAX have zero reticulations. Otherwise, this value is -1.
- rf_relative_inferred: Relative RF distance between the network inferred by NetRAX and the simulated network, if both the simulated network and the network inferred by NetRAX have zero reticulations. Otherwise, this value is -1.
- near_zero_branches_raxml: Number of near-zero branches in the maximum likelihood tree inferred by raxml-ng.
- runtime_inference: Elapsed runtime in seconds for the network inference with NetRAX.
-  hardwired_cluster_distance: Hardwired cluster distance between the simulated network and the network inferred by NetRAX (computed via Dendroscope).
- softwired_cluster_distance: Softwired cluster distance between the simulated network and the network inferred by NetRAX (computed via Dendroscope).
- displayed_trees_distance: Displayed trees distance between the simulated network and the network inferred by NetRAX (computed via Dendroscope).
- tripartition_distance: Tripartition distance between the simulated network and the network inferred by NetRAX (computed via Dendroscope).
- nested_labels_distance: Nested labels distance between the simulated network and the network inferred by NetRAX (computed via Dendroscope).
- path_multiplicity_distance: Path multiplicity distance between the simulated network and the network inferred by NetRAX (computed via Dendroscope).

**Do we need to report anything else?**
lutteropp commented 3 years ago

Do we need to add these Dendroscope distances to the simulated network also for the maximum-likelihood tree inferred by raxml-ng? If so, how to solve the rooting issue, since the tree inferred by raxml-ng is unrooted? One idea would be rooting at the longest branch...

stamatak commented 3 years ago

or use RootDigger ? or the rooting that yields the lowest score?

Alexis

On 30.11.20 02:06, Sarah Lutteropp wrote:

Do we also need these Dendroscope distances for the maximum-likelihood tree inferred by raxml-ng? If so, how to solve the rooting issue, since the tree inferred by raxml-ng is unrooted? One idea would be rooting at the longest branch...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/14#issuecomment-735479851, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6WRYPDERXNYDZOFPBLSSLOXTANCNFSM4UG3MHAA.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

stamatak commented 3 years ago

what about some of the key simulation paramaters used for generating the respective dataset? this might help to better interpret and discuss the results,

alexis

On 30.11.20 01:58, Sarah Lutteropp wrote:

The current results CSV header consists of:

name | n_taxa | n_trees | n_reticulations | msa_size | sampling_type | simulation_type | likelihood_type | timeout | n_random_start_networks | n_parsimony_start_networks | start_from_raxml | n_reticulations_inferred | bic_true | logl_true | bic_inferred | logl_inferred | bic_raxml | logl_raxml | rf_absolute_raxml | rf_relative_raxml | rf_absolute_inferred | rf_relative_inferred | near_zero_branches_raxml | hardwired_cluster_distance | softwired_cluster_distance | displayed_trees_distance | tripartition_distance | nested_labels_distance | path_multiplicity_distance | runtime_inference

  • name: The name of the dataset.
  • n_taxa: Number of taxa in the simulated network.
  • n_trees: Number of displayed trees in the simulated network (it is 2^n_reticulations).
  • n_reticulations: Number of reticulations in the simulated network.
  • msa_size: Total MSA size (there might be up to n_trees sites more due to rounding issues, depending on the chosen sampling type).
  • sampling_type: The sampling type used. It is one of

|class SamplingType(Enum): STANDARD = 1 # randomly choose which tree to sample, then sample equal number of sites for each sampled tree - this is the only mode that uses the n_trees or m parameter for sampling PERFECT_SAMPLING = 2 # sample each displayed tree, and as many site as expected by the tree probability PERFECT_UNIFORM_SAMPLING = 3 # sample each displayed tree, with the same number of sites per tree (ignoring reticulation probabilities) SINGLE_SITE_SAMPLING = 4 # sample each site individually, with the reticulation probabilities in mind |

  • simulation_type: It is one of

|class SimulationType(Enum): CELINE = 1 # use Celine's network topology simulator SARAH = 2 # use Sarah's ad-hoc network topology generator |

  • likelihood_type: It is one of

|class LikelihoodType(Enum): AVERAGE = 1 # use weighted average of displayed trees BEST = 2 # use best displayed tree |

  • timeout: If no start network is specified and no number of random/parsimony start networks are specified, then a value larger than 0 means that NetRAX will continue searching from new random start networks until $(timeout) seconds have passed.
  • n_random_start_networks: Number of random start trees for the NetRAX network search
  • n_parsimony_start_networks: Number of parsimony start trees for the NetRAX network search
  • start_from_raxml: If TRUE, run NetRAX search only from best ML tree inferred by raxml-ng. If FALSE, run NetRAX search from some random/parsimony start trees.
  • n_reticulations_inferred: Number of reticulations in the network inferred by NetRAX.
  • bic_true: BIC score of the simulated network (using the specified likelihood_type).
  • logl_true: Network loglikelihood score of the simulated network (using the specified likelihood_type).
  • bic_inferred: BIC score of the network inferred by NetRAX (using the specified likelihood_type).
  • logl_inferred: Network loglikelihood score of the network inferred by NetRAX (using the specified likelihood_type).
  • bic_raxml: BIC score of the maximum likelihood tree inferred by raxml-ng (using the specified likelihood_type).
  • logl_raxml: Network loglikelihood score of the maximum likelihood tree inferred by raxml-ng (using the specified likelihood_type).
  • rf_absolute_raxml: Absolute RF distance between the maximum likelihood tree inferred by raxml-ng and the simulated network, if the simulated network has zero reticulations. Otherwise, this value is -1.
  • rf_relative_raxml: Relative RF distance between the maximum likelihood tree inferred by raxml-ng and the simulated network, if the simulated network has zero reticulations. Otherwise, this value is -1.
  • rf_absolute_inferred: Absolute RF distance between the network inferred by NetRAX and the simulated network, if both the simulated network and the network inferred by NetRAX have zero reticulations. Otherwise, this value is -1.
  • rf_relative_inferred: Relative RF distance between the network inferred by NetRAX and the simulated network, if both the simulated network and the network inferred by NetRAX have zero reticulations. Otherwise, this value is -1.
  • near_zero_branches_raxml: Number of near-zero branches in the maximum likelihood tree inferred by raxml-ng.
  • runtime_inference: Elapsed runtime in seconds for the network inference with NetRAX.
  • hardwired_cluster_distance: Hardwired cluster distance between the simulated network and the network inferred by NetRAX (computed via Dendroscope).
  • softwired_cluster_distance: Softwired cluster distance between the simulated network and the network inferred by NetRAX (computed via Dendroscope).
  • displayed_trees_distance: Displayed trees distance between the simulated network and the network inferred by NetRAX (computed via Dendroscope).
  • tripartition_distance: Tripartition distance between the simulated network and the network inferred by NetRAX (computed via Dendroscope).
  • nested_labels_distance: Nested labels distance between the simulated network and the network inferred by NetRAX (computed via Dendroscope).
  • path_multiplicity_distance: Path multiplicity distance between the simulated network and the network inferred by NetRAX (computed via Dendroscope).

Do we need to report anything else?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/14, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6RSIT4CANL5Q3RJ6MLSSLNZTANCNFSM4UG3MHAA.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

lutteropp commented 3 years ago

I decided to not compute those Dendroscope topological distances for the raxml-ng best tree. After all, our goal is to infer networks, not look at how much worse raxml-ng tree topology scores the more reticulations we add... And especially with these Dendroscope scores being dependent on the rooting, this just causes irritation in the trees case. I have added RF-distance if we are caring about trees. Raxml-ng is not a network inference tool, after all...

And if we do extra treatment for raxml-ng to tune those Dendroscope scores, then we'd need to do the same extra treatment for NetRAX to remain comparable.

lutteropp commented 3 years ago

Adding the parameters used by Celines simulator to the CSV output. But keep in mind that the exact same choice of simulation parameters shows extreme variance in terms of number of taxa/number of reticulations obtained.

lutteropp commented 3 years ago

Added the simulator parameters to the results.csv in https://github.com/lutteropp/NetRAX/commit/09577b7222ffdbef51780344aa55592cd2139004, also updated the first message in this GitHub issue to also list these.

lutteropp commented 3 years ago

Move type usage statistics (e.g.: We accepted 5 RNNI moves and 2 RSPR1 moves when doing network search from this start network) would be very helpful. NetRAX currently prints these to the console, but I don't have some centralized solution for collecting these infos and summarizing them yet.

stamatak commented 3 years ago

that's maybe also rather for the second paper, but it does make sense to think about a good way of collecting these data.

There is one paper where this was done systematically for Bayesian Inference MCMC moves, might serve as a guidance perhaps:

https://academic.oup.com/sysbio/article/57/1/86/1704335

On 30.11.20 18:24, Sarah Lutteropp wrote:

Move type usage statistics (e.g.: We accepted 5 RNNI moves and 2 RSPR1 moves when doing network search from this start network) would be very helpful. NetRAX currently prints these to the console, but I don't have some centralized solution for collecting these infos and summarizing them yet.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/14#issuecomment-735891245, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6XFV4VRHMZ57J3AK2LSSPBNPANCNFSM4UG3MHAA.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org