lutteropp / NetRAX

Phylogenetic Network Inference without ILS
GNU General Public License v3.0
17 stars 1 forks source link

Profiling a NetRAX run on a tree dataset #12

Open lutteropp opened 3 years ago

lutteropp commented 3 years ago

I did a profiling run with NetRAX, on a 1000 sites MSA (belonging to a simulated 8 taxon-tree), with 1 random and 1 parsimony starting tree. Turns out that trying arc insertion moves is pretty slow (simply because the 1-arc-insertion-move-neighborhood of a network is quite large), but most runtime is spent in branch-length optimization using that Brent-method (because it does tons of likelihood recomputations).

netrax_callgraph

lutteropp commented 3 years ago

Given ArcInsertion moves slow down things a lot (because the 1-Arc-Insertion-Move-Neighborhood is pretty large), it might make sense to deactivate them in the search, and only use the more restrictive subset of DeltaPlus moves.

Or alternatively, I also had the idea of doing a search in waves, with a fixed number of reticulations in each wave...

lutteropp commented 3 years ago

I wonder if there is a way to build a much smaller, but still somehow similar MSA to the one given to NetRAX. Something like "given this original MSA, reduce it to a MSA with only 100 sites in it. Do low-resulution likelihood computations on this massively shrinked MSA during intermediate branch length optimization". One could even do this hierarchically, with different amounts of compression...

lutteropp commented 3 years ago

Other improvement ideas include:

lutteropp commented 3 years ago

Yet another optimization idea: If adding reticulations didn't work out with a previous random starting tree, don't try adding them again with the next one...

stamatak commented 3 years ago

Those ideas all sound good, but I would really just try to focus on running some brute force experiments now, we can worry about this later on, I would also not deactivate the arc insertion moves for the time being, although they might not be required for such small trees ...

alexis

On 30.11.20 02:11, Sarah Lutteropp wrote:

Yet another optimization idea: If adding reticulations didn't work out with a previous random starting tree, don't try adding them again with the next one...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/12#issuecomment-735480637, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6S2C7MRTVER32QVTK3SSLPKDANCNFSM4UGYL6RA.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

celinescornavacca commented 3 years ago

I agree with Alexis, We are now testing if the method recovers networks, we will improve the speed later

lutteropp commented 3 years ago

I agree that these optimizations are for the second paper. Wanted to post them as a discoverable GitHub issue nonetheless.

lutteropp commented 3 years ago

About the arc insertion moves: The 1-arc-insertion-move neighborhood is extremely large (O(n^3) neighbors), combined with running bruteforce single-threaded and always reoptimizing branches this will take weeks on the cluster...

At least not trying more arc insertions than worked well in the previous start network, or replacing the general arc insertion moves by the more restrictive DeltaPlus moves (these add reticulations as well), or doing less brlen optimization calls, or running each network inference in a thread, will turn the runtime required for the experiments into a much more reasonable value.

Just for comparison: A single NetRAX run (with 20 starting trees) on a tiny simulated tree dataset with 10 taxa and 500 MSA sites took 120 seconds!!!

lutteropp commented 3 years ago

At the very least, I need to parallelize the NetRAX calls from within the Python script used for the experiments...

lutteropp commented 3 years ago

This commit (https://github.com/lutteropp/NetRAX/commit/44a603acb5077826a462e52ef33fc318ac67e536) reduced the required runtime to run NetRAX search a lot. What it does is redefining the maximum number of reticulations to be the number of reticulations that was found in the best inferred network from the first start tree.

It makes perfectly sense for inferring trees, and I believe it also makes perfectly sense for inferring networks.

stamatak commented 3 years ago

that's not a bad execution time

On 30.11.20 11:46, Sarah Lutteropp wrote:

About the arc insertion moves: The 1-arc-insertion-move neighborhood is extremely large (O(n^3) neighbors), combined with running bruteforce single-threaded and always reoptimizing branches this will take weeks on the cluster...

At least not trying more arc insertions than worked well in the previous start network, or replacing the general arc insertion moves by the more restrictive DeltaPlus moves (these add reticulations as well), or doing less brlen optimization calls, or running each network inference in a thread, will turn the runtime required for the experiments into a reasonable value.

Just for comparison: A single NetRAX run (with 20 starting trees) on a simulated tree dataseg with 10 taxa took 120 seconds!!!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/12#issuecomment-735675801, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6TMOIY64SXZKEAHQ6DSSNSYFANCNFSM4UGYL6RA.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

stamatak commented 3 years ago

I don't understand why this makes sense for networks, aren't we restricting the number of reticulations to much? Maybe I am missing something ...

On 30.11.20 16:01, Sarah Lutteropp wrote:

This commit (44a603a https://github.com/lutteropp/NetRAX/commit/44a603acb5077826a462e52ef33fc318ac67e536) reduced the required runtime to run NetRAX search a lot. What it does is redefining the maximum number of reticulations to be the number of reticulations that was found in the best inferred network from the first start tree.

It makes perfectly sense for inferring trees, and I believe it also makes perfectly sense for inferring networks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/12#issuecomment-735803509, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6V6OG7WCYGCEX6G4D3SSOQUHANCNFSM4UGYL6RA.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

lutteropp commented 3 years ago

120 seconds for a tiny 10-taxon tree with 500 MSA sites is one thing. But as soon as we have more taxa, or even more reticulations, runtime will explode with the current version. (Especially due to O(n^3) 1-move-neighborhood size for the very general ArcInsertion move type, combined with tons of brlen-optimization calls after trying each move)

Regarding the "restrict maximum number of reticulations to try based on inferred network from first starting tree" idea: This is using the assumption that different start trees don't lead to different number of reticulations in the end result - only to otherwise differing topology. Of course, this assumption needs to be empirically validated. For now, I have abandoned this improvement idea. But it could make sense for the second paper to try it again.

stamatak commented 3 years ago

120 seconds for a tiny 10-taxon tree with 500 MSA sites is one thing. > But as soon as we have more taxa, or even more reticulations, runtime will explode with the current version. (Especially due to O(n^3) 1-move-neighborhood size for the very general ArcInsertion move type, combined with tons of brlen-optimization calls after trying each move)

Well, okay, I expected much worse.

Regarding the "restrict maximum number of reticulations to try based on inferred network from first starting tree" idea: This is using the assumption that different start trees don't lead to different number of reticulations in the end result - only to otherwise differing topology. Of course, this assumption needs to be empirically validated. For now, I have abandoned this improvement idea. But it could make sense for the second paper to try it again.

Okay, not it makes sense, I was just not aware of the assumption you made, just had the impression that there must be one.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/12#issuecomment-736018198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6SMU7CABW76GY5B5CTSSP4TVANCNFSM4UGYL6RA.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

celinescornavacca commented 3 years ago

For sure, we will need to get smarter in the next paper, by trying to score moves with parcimony etc (more fun!). For now, let's just focus on the recovery performances.