lutteropp / NetRAX

Phylogenetic Network Inference without ILS
GNU General Public License v3.0
17 stars 0 forks source link

Detailed Future Work Ideas #85

Open lutteropp opened 3 years ago

lutteropp commented 3 years ago

Detailed future work ideas, such that we can remove them from the supplement:

\section{Detailed Future Work Ideas} \subsection{Runtime Improvements}

\subsubsection{Improve virtual re-rooting during Branch-Length Optimization} Currently, we always reroot back to original root before going on with the next branch to optimize. A possible alternative solution would be: Store which nodes were the children used in the DisplayedTreeData. Then, it makes sense to store multiple DisplayedTreeData's in a node, with the same reticulation choices but different set of children. And then one has to make sure to also check for compatible children setting when looking at a DisplayedTreeData from a child at a current node (it is only compatible if the current node does not show up in the list of children). To do final tree logl evaluation right, keep a flag stating whether a tree was newly added or not. And only evaluate using the newly added trees.

So far, after we finished optimizing a branch, we then re-update the CLVs with regard to the original network root. This means we have virtual_root_1 $\rightarrow$ network_root $\rightarrow$ virtual_root_2 $\rightarrow$ network_root $\rightarrow$ virtual_root_3 $\rightarrow$ network_root.

Regarding minimizing the total number of CLVs to recompute, the solution proposed here implements virtual_root_1 $\rightarrow$ virtual_root_2 $\rightarrow$ virtual_root_3 $\rightarrow$ network_root.

\subsubsection{Speed up Branch-Length and Model Optimization} Do model optimization less often -- currently, we do it at the beginning, once at the end, and every time we accept a move. Another future work thing to try later: Currently, NetRAX optimizes all branch lengths in the choosing phase. Maybe it's enough to just optimize the branches within a certain radius. Speedup from this shouldn't be super large though, as we typically have less than 10 candidate networks to score in the choosing phase. But maybe it's also enough to just optimize branches within a radius when accepting a move (currently, we do full modelopt, reticulation opt, and full brlen opt when accepting a move).

While still nowhere used, NetRAX offers functionality to compute loglikelihood on just a subset of displayed trees. Explore if we can use it in nontopology parameter optimiziation. When doing minor local optimizations, maybe it suffices to sample likelihood over the most probable displayed trees?

\subsubsection{Faster Candidate Filtering -- Pseudolikelihood} When scoring potential move candidates in the filtering phases during the network search algorithm, NetRAX frequently computes Phylogenetic Network Loglikelihood. We can speed up the initial candidate filtering by using a faster likelihood function, such as pseudolikelihood~\cite{nguyen2015likelihood}, as a proxy for the real network likelihood. One can also investigate how the wrong NEPAL-likelihood (where blob-optimization works and where we can define a network CLV) performs if we are in a horizontal move search round.

Alternatively: When enumerating all possible candidates for the given move type, use parsimony scores to decide which candidates to test.

\subsubsection{Using Ancestral States} Instead of enumerating all candidate networks for a given move type, we can compute ancestral states for each internal node in the network. Using these ancestral states, we can make informed predictions about which moves would lead to a higher-scoring network. We expect the use of ancestral states to be especially promising when pre-identifying promising arc insertion move candidates. Moreover, when doing branch length optimization, we can use the genetic distance between ancestral states in the network before applying a move to provide an initial educated guess for newly inserted or redirected branches.

Pre-identify the most promising move candidates this way, and try them first. If we already found a highly promising BIC-improving network this way, we may skip scoring the remaining move candidates.

We can either find a way to define ancestral states in a network directly, or make use of the ancestral states in its displayed trees. For computing ancestral state genetic distance between two nodes in a network, we can then use the minimum genetic distance between ancestral states in its displayed trees, weighted by displayed tree probability.

\subsubsection{Hash already encountered networks} In order to not revisit the same network more than once, we suggest keeping all accepted and already encountered network topologies during the search in a hash set (use a LRU cache?).

\subsubsection{Avoid isomorphic Networks -- Identifiability Issues} \begin{itemize} \item Discuss identifiability issues: \url{https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4388854/} \item identifiability issue, different networks can have same displayed trees --> graph isomorphism on rooted DAGs for checking network topology \item Skip move candidates that would lead to an indistinguishable network. \end{itemize} identifiability issue, different networks can have same displayed trees --> graph isomorphism on rooted DAGs for checking network topology

\subsubsection{Inferring large networks -- Divide and Conquer} The current NetRAX software is still too slow to perform network inference on large dataset with a potentially high number of reticulations directly. The number of candidate moves to evaluate in each filtering step rises with the number of taxa in the dataset. Thus, we propose the following pipeline: \begin{enumerate} \item Infer a maximum-likelihood phylogenetic tree on the dataset, using RAxML-NG. \item Delimit species on this tree by using mPTP.\@ \item Collapse each delimited species in the tree, replacing each species by its most recent common ancestor. Use RAxML-NG to compute ancestral state sequences for the species roots, using them in the reduced MSA.\@ \item Using NetRAX, infer a phylogenetic network on this reduced dataset. \item Uncollapse the species in the inferred reduced network, replacing each species root with its species subtree. \item Perform a last NetRAX search on the uncollapsed network, in order to refine the final network. \end{enumerate} We expect that in the last pipeline step, the network does not change much. Intuitively, we expect that the placement of reticulations will undergo minor changes. Thus, we propose restricting the set of candidate moves in the last refinement NetRAX run to moves that relocate a reticulation parent or child within a small radius. The underlying assumption within our proposed solution is that reticulations are less likely to have split up species individuals.

An alternative implementation of the above idea is: \begin{enumerate} \item Infer RAxML-NG best tree on the entire dataset. \item Run mptp species delimitation on it. \item Call NetRAX with both RAxML-NG best tree and mptp species delimination output file. \item In the NetRAX run, do the following: \begin{enumerate} \item For each delimited species root, mark all nodes in the subtree rooted at the species root as FORBIDDEN.\@ \item Do a network inference search where we do not allow moves to include any of the FORBIDDEN nodes. \item Remove the FORBIDDEN mark from all nodes. \item Do a second network inference search run, starting from the best inferred network so far. \end{enumerate} \end{enumerate}

\subsubsection{Actually collapse simple paths instead of using fake nodes} If we want to reduce the number of CLV update operations and are willing to increase code complexity, we can collapse simple paths on a per-displayed-tree basis.

\subsubsection{Improve Dealing with Numerical Issues} Avoid using the slow MPFR multiprecision library: implement Kahan's summation algorithm (\url{https://en.wikipedia.org/wiki/Kahan_summation_algorithm}), compute faster logarithms.

\subsubsection{Improve Scalability} In our current NetRAX MPI parallelization, we are parallelizing over MSA sites, like raxml-ng does. This is bad as we need at least 1000 sites per node. We evaluate lots of moves, and our only goal is to decide on the one best move we are going to accept and commit to.

A better parallelization scheme (allowing us to use more computers) would be to parallelize over the move candidates \textbf{and} sites. In this setup, we partition the set of computers into subsets. Each subset operates on a different set of move candidates. We parallelize over sites within the computers belonging to the same subset. The amount of data to send around via MPI still remains low. Just O(1) numbers, and telling each computer which move we globally accepted to let them update their network. For the subsets, we need dynamic load balancing. Subset assignments change depending on the number of move candidates we currently are evaluating. If we combine this with a better search algorithm (like starting parallel searches committing to alternative promising moves), the number of computers we can productively use is sheer endless.

\subsection{Improving Inference Quality, Escaping Local Optima}

\subsubsection{Avoid local optima} Replace the simple greedy network search algorithm by reversible-jump MCMC or a simulated annealing approach.

\subsubsection{Improving Inference Quality -- Promising State Queue} NetRAX currently does not provide checkpointing. After adding checkpointing functionality that allows us to restart the search from any intermediary search state, we propose the following approach to improve our inferency quality: Keep alternative choices in a PromisingStateQueue. We can implement this PromisingStateQueue as a priority queue, scoring each alternative promising state by its BIC score. When we have to choose 1 out of 5 highly promising candidates in the choosing phase, why discard the others? In order to not revisit the same network more than once, we suggest keeping all accepted and already encountered network topologies during the search in a hash set (use a LRU cache?). This way, when restarting the search from another promising state than the one taken in the initial search phase, we avoid revisiting the same network topology more than once and can stop the redundant repeated search steps that would follow from such a network. We expect the PromisingStateQueue approach to help NetRAX avoid getting stuck in local optima, while at the same time leading to faster total inference time compared to entirely restarting the search from multiple random or parsimony trees.

\subsubsection{Improve Network Root Choice} ... Midpoint rooting on start tree?

\subsubsection{Scrambling} try to escape local optima by scrambling the best found network

\subsubsection{Better start networks} Start from maximum parsimony networks.

\subsubsection{Delay BIC Decision} I figured out why NetRAX sometimes infers 1 reticulation less than the true number of reticulations: It's a design flaw in the search algorithm! In the current version, if the best found arc insertion move before the search stops led to a worse BIC than we had for the network before, the search stops. Even if that arc insertion move led to a slightly better loglikelihood! Instead, what we should do is performing an additional search run where we take the arc insertion move that worsened BIC the least, and do a full round of horizontal moves on it. Only after that we should decide whether to stop the search or not. This change would require some medium-sized code changes though. Thus I am just adding it to the Future Work section.

\subsection{Additional Features}

\subsubsection{More network likelihood Definitions} Support more network likelihood definitions. Keep in mind that there is work which uses HMMs to model site dependency: \url{https://pdfs.semanticscholar.org/592f/40608fb1fe5b642d790b054551e1e7ef6135.pdf}

\subsubsection{Improve Simulation of Phylogenetic Networks} Do more simulations 3 simulators (celine, sarah, empirical datasets)

\subsubsection{Support scaled branch lengths model} Most of the NetRAX codebase already supports the scaled branch lengths model (where partitions share branch lengths, but each partition has its own scaling factors). Only the function that computes displayed tree likelihood derivatives still has a missing TODO in it.

\subsubsection{Improve General Software Quality and User-Friendliness} \begin{itemize} \item Implement checkpointing. \item Add CI/CD and some better unit tests for it. \item Improve software quality (more unit tests, check with softwipe, test if it works on a Mac, add checkpointing) \item Add support for PTHREADS.\@ Currently, only the MPI version works. It's likely something with the CMakeLists.txt \item Provide a graphical user interface, showing some BIC progress plots, automatically drawing the network after each accepted move - maybe even animate the current network, showing how the moves get performed. \item Write a user manual. \item Improve user friendliness (Write a user manual, make a website for NetRAX, provide a GUI, offer a webservice, support more data formats, add Python and R bindings, distribute NetRAX where biologists can find it (Bioconda etc)) \end{itemize}

lutteropp commented 2 years ago

I learned about an empirical mouse genomes dataset here: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1277-1 ... apparently mouse ancestors also didn't care about species boundaries and happily exchanged genetic information. Still less crazy than wheat though. Idk about dataset size yet, but from the paper it sounded like only 2-3 reticulations. ... Might be interesting future work.