lutteropp / NetRAX

Phylogenetic Network Inference without ILS
GNU General Public License v3.0
17 stars 1 forks source link

Comments of Sarah's document #2

Open celinescornavacca opened 3 years ago

celinescornavacca commented 3 years ago

In the text, you will find my comments. Please use Adobe Reader to see them

NetRAX___Phylogenetic_Network_Inference.pdf

I want to stress here that our definition 3 correspond to Eq 4 in the NEPAL paper, but we could also use Eq 5 and not use the weighted average but the best tree as in their Eq 5. I wonder if you code this too (it should be easy) and see what performs the best. I actually do not know what will perform the best. I guess it is a modelling choice.

An idea for a next version: I think that the blob traversal can still be used if we consider a few (to define what is few with simulations) columns of a same partition at once, and not a single character. If you have a lot of reticulations, it may save times, because you have to consider more than one column at once, but you can you the blob optimisation on it.

lutteropp commented 3 years ago

Just for cross-referencing: The document can be found and edited here: https://www.overleaf.com/8389195984mmncnkcgrfwh

I added all comments in red text color, to be removed one-by-one after I went through them/ changed the write-up accordingly. Your comments did also show in Evince Document Viewer.

celinescornavacca commented 3 years ago

Thanks a lot for the draft. Sections 6.1, 6.2, 8.2, 8.3 in my option still need some work for the reader to follow all the details of your hard work. netrax_paper_210709_181201.pdf

lutteropp commented 3 years ago

Hi Celine,

Thank you for your very detailed and helpful comments.

There are some comments where I do not understand how to respond to them. The remaining comments are clear to me.

I wrote: The probability P(T|N) of a displayed tree T in a phylogenetic network N is the product of the probabilities of the hybridization edges that display T. Your comment says: It is not clear that you want to compute the probability wrt. to an alignment. My confusion is: But the product of reticulation probabilities has nothing to do with the alignment?? It's just a branch probability we have in addition to a branch length... Do you want me to state the complete network likelihood function definition already in the motivation section? I think the comment mixed up displayed tree probability (where we just take the product of reticulation probabilities) and displayed tree likelihood (where we use the MSA). My suggested action is: Emphasize that we mean the probability of displaying the displayed tree here.

Your comment says: What about scaled? Why don't we support it? See answer here: https://github.com/lutteropp/NetRAX/issues/46#issuecomment-865705453 (It is nearly fully implemented, but one implementation question remained open. @stamatak said I should just kick scaled branches out of the current NetRAX version) My suggested action is: Do nothing about it.

Your comment says: It is not clear how these CLVs are computed on the network and how they are used to compute the actual likelihood [...]. An example on a small network is neccessary. My confusion is: Are you sure I should explain the Felsenstein algorithm and how phylogenetic tree likelihood is computed by using CLVs? This is not my work, it is fully done in libpll. I also added a link to it to the manuscript already, where it is explained in the libpll-wiki. Because we do not have network CLVs. The CLVs are just for the displayed trees. My suggested action is: Remove the CLV picture and try to avoid mentioning CLVs, just say that these are an internal data structure used by libpll to compute loglikelihood of a phylogenetic tree via a post-order traversal of the tree.

I wrote: All displayed trees of a network share the same branch lengths. Your comment says: Remind that we use the linked model here. My confusion is: But regardless of the branch lengths model, the branch lengths are always the same in the network, no matter at which displayed trees we look at. The branch lengths belong to the partitions! My suggested action is: Add "Regardless of branch length model choice," to the sentence.

I wrote: Analogously, for networks, we need to recompute the CLVs which lie on any path between the old and new virtual root (both ends included). Your comment says: do not describe this using the past tense. My confusion is: I did not use past tense in this sentence. Maybe it didn't get clear that by "both ends included", I mean that we also need to recompute the CLVs for the old and new virtual root? My suggested action is: Replace "both ends included" by "This includes the CLVs for the old and new virtual root."

I wrote: 8.2 Virtually Rerooting a Phylogenetic Network. Your comment says: I couldn't follow this section. My confusion is: What can I do for making this section more accessible? Would it help if I add another subsection about how phylogenetic trees are virtually rerooted when it comes to branch length optimization, in order to reduce the number of CLVs to update? It is very specific to RAxML-NG and has to do with how the Felsenstein algorithm works (it does a bottom-up traversal on the tree for updating the CLVs, always using the child CLVs to compute the parent CLV). My suggested action is: Write a short paragraph in the beginning of the section, explaining that libpll computes the CLV for a node by combining the CLVs for its children nodes and we thus do virtual re-rooting to minimize the number of CLV updates, like is done for phylogenetic trees in RAxML-NG. Then say that when dealing with phylogenetic networks, we need to handle tricky situations due to changes in edge direction in the displayed trees after virtual re-rooting.

I wrote: So far, NetRAX cannot directly start a RAxML-NG tree search in order to initiate the network search from the best tree found by RAxML-NG. Your comment says: Why? I am just being curious. My answer is: It was just difficult to implement, and lots of work for very little benefit. RAxML-NG does not offer a library or so. My suggested action is: Leave it as it is.

Your comment is: [There are some question marks for the pseudocode algorithms I put in the draft for explaining how the candidate filtering and deciding for the move we want to apply and commit to works] My confusion is: What did you mean by the question marks? My suggested action is: Ignore the question marks, these pseudocodes will go to the Appendix anyway.

In general, I agree that all the pseudocodes will have to go to the Appendix. I am really unsure about whether we should explain Felsenstein algorithm and CLVs in detail in the paper. After all, I already linked to a really nice libpll-wiki post that explains them perfectly, and I did not change anything in there as they operate on a single phylogenetic tree.

lutteropp commented 3 years ago

I have updated my previous message with My suggested action is statements, to make it more actionable. Please open the message in the GitHub issue instead of replying to the e-mail that still has the outdated version.

... Do you agree with my suggested actions?

celinescornavacca commented 3 years ago

Here are my answers:

I wrote: The probability P(T|N) of a displayed tree T in a phylogenetic network N is the product of the probabilities of the hybridization edges that display T.
Your comment says: It is not clear that you want to compute the probability wrt. to an alignment.
My confusion is: But the product of reticulation probabilities has nothing to do with the alignment?? It's just a branch probability we have in addition to a branch length... Do you want me to state the complete network likelihood function definition already in the motivation section? I think the comment mixed up displayed tree probability (where we just take the product of reticulation probabilities) and displayed tree likelihood (where we use the MSA).
My suggested action is: Emphasize that we mean the probability of displaying the displayed tree here.

OK, but you have to say before this section what is the final goal of your tool. (Maybe in the introdution?) You have to walk the reader throw the subections in a paragraph before subsection 2.1

Your comment says: What about scaled? Why don't we support it?
See answer here: #46 (comment) (It is nearly fully implemented, but one implementation question remained open. @stamatak said I should just kick scaled branches out of the current NetRAX version)
My suggested action is: Do nothing about it.

My suggestion: mention it as future work, I think it is an important feature.

Your comment says: It is not clear how these CLVs are computed on the network and how they are used to compute the actual likelihood [...]. An example on a small network is neccessary.
My confusion is: Are you sure I should explain the Felsenstein algorithm and how phylogenetic tree likelihood is computed by using CLVs? This is not my work, it is fully done in libpll. I also added a link to it to the manuscript already, where it is explained in the libpll-wiki. Because we do not have network CLVs. The CLVs are just for the displayed trees.
My suggested action is: Remove the CLV picture and try to avoid mentioning CLVs, just say that these are an internal data structure used by libpll to compute loglikelihood of a phylogenetic tree via a post-order traversal of the tree.

I think you should explain what CLVs do without explaining how in details.

I wrote: All displayed trees of a network share the same branch lengths.
Your comment says: Remind that we use the linked model here.
My confusion is: But regardless of the branch lengths model, the branch lengths are always the same in the network, no matter at which displayed trees we look at. The branch lengths belong to the partitions!
My suggested action is: Add "Regardless of branch length model choice," to the sentence.

"The branch lengths belong to the partitions", exactement, I would like this to show in the general formula for the unlinked model. Now you do not see any dependecy of the parameters on the partitions.

I wrote: Analogously, for networks, we need to recompute the CLVs which lie on any path between the old and new virtual root (both ends included).
Your comment says: do not describe this using the past tense.
My confusion is: I did not use past tense in this sentence. Maybe it didn't get clear that by "both ends included", I mean that we also need to recompute the CLVs for the old and new virtual root?
My suggested action is: Replace "both ends included" by "This includes the CLVs for the old and new virtual root."

I was referring to the senctence after the one you quoted "The difficulty here is that while we had the fixed network root, "

I wrote: 8.2 Virtually Rerooting a Phylogenetic Network.
Your comment says: I couldn't follow this section.
My confusion is: What can I do for making this section more accessible? Would it help if I add another subsection about how phylogenetic trees are virtually rerooted when it comes to branch length optimization, in order to reduce the number of CLVs to update? It is very specific to RAxML-NG and has to do with how the Felsenstein algorithm works (it does a bottom-up traversal on the tree for updating the CLVs, always using the child CLVs to compute the parent CLV).
My suggested action is: Write a short paragraph in the beginning of the section, explaining that libpll computes the CLV for a node by combining the CLVs for its children nodes and we thus do virtual re-rooting to minimize the number of CLV updates, like is done for phylogenetic trees in RAxML-NG. Then say that when dealing with phylogenetic networks, we need to handle tricky situations due to changes in edge direction in the displayed trees after virtual re-rooting.

Again, I think you should explain what you do without explaining how in details. The important thing to explain here is the difficulty of doing the rerooting with networks and not with trees. Stay high level, explain the main ideas. This is a general suggestion, people do not often want to see the painful details ;)

I wrote: So far, NetRAX cannot directly start a RAxML-NG tree search in order to initiate the network search from the best tree found by RAxML-NG.
Your comment says: Why? I am just being curious.
My answer is: It was just difficult to implement, and lots of work for very little benefit. RAxML-NG does not offer a library or so.
My suggested action is: Leave it as it is.

OK.

Your comment is: [There are some question marks for the pseudocode algorithms I put in the draft for explaining how the candidate filtering and deciding for the move we want to apply and commit to works]
My confusion is: What did you mean by the question marks?
My suggested action is: Ignore the question marks, these pseudocodes will go to the Appendix anyway.

I find strange to write a pseudocode with functions that are not defined anywhere in the text. So my question marks were "where are the funtions?" If you do not want to detail them, just write a line saying what the do. But use words and not the function syntax, if you do not define them before their usage.

Thanks!

lutteropp commented 3 years ago

Thank you @celinescornavacca, this helps a lot. Now I know exactly how to proceed! :)

lutteropp commented 3 years ago

Thanks, I have taken all your comments into account and revised the write-up accordingly.

celinescornavacca commented 3 years ago

Thanks! Internet was down for 48h! Can you send me the overleaf link here? I lost it in Slack :(

lutteropp commented 3 years ago

Internet was down for 48h!

48 hours? Wow!

Can you send me the overleaf link here?

Link to the paper: https://www.overleaf.com/2575832168cydzfwcpjkvj Link to the supplement: https://www.overleaf.com/5323862377sdgjfsqvqscx

@stamatak Already commented on the parts of the draft that are ready yet, I am still working on improving the write-up according to his Overleaf comments. It probably makes no sense to re-iterate on reviewing the write-up before I am done with improving the text with regard to his comments.

I was still busy a lot with re-running some experiments, and resolving some counter-intuitive mystery observations. But now I believe the simulated data experiments are all finished. Thus, next step for me is to focus all my work on the paper draft. Improving the sections @stamatak commented on, and then continuing writing up the missing sections (largest chunk of work there is to summarize the 40 pages of experimental result plots and tables I produced and put into the supplement for now).

celinescornavacca commented 3 years ago

Thanks for the links, I am in the countryside and someone cut a cable while doing some construction work.

I will have a look but I will wait for the new version to make detailed comments, ok?

On July 30, 2021 12:39:02 PM GMT+02:00, Sarah Lutteropp @.***> wrote:

Internet was down for 48h!

48 hours? Wow!

Can you send me the overleaf link here?

Link to the paper: https://www.overleaf.com/2575832168cydzfwcpjkvj Link to the supplement: https://www.overleaf.com/5323862377sdgjfsqvqscx

@.*** Already commented on the parts of the draft that are ready yet, I am still working on improving the write-up according to his Overleaf comments. It probably makes no sense to re-iterate on reviewing the write-up before I am done with improving the text with regard to his comments.

I was still busy a lot with re-running some experiments, and resolving some counter-intuitive mystery observations. But now I believe the simulated data experiments are all finished. Thus, next step for me is to focus all my work on the paper draft. Improving the sections @.*** commented on, and then continuing writing up the missing sections (largest chunk of work there is to summarize the 40 pages of experimental result plots and tables I produced and put into the supplement for now).

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/lutteropp/NetRAX/issues/2#issuecomment-889805627

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

lutteropp commented 3 years ago

I will have a look but I will wait for the new version to make detailed comments, ok?

Okay! I am nearly done with remodeling the text following the comments by @stamatak. I will finish this today and continue working on the remaining sections over the weekend.

celinescornavacca commented 3 years ago

OK let me know when you are done. I will read it on Monday

celinescornavacca commented 3 years ago

Comments until page 8 (after the draft seems to be still under construction ;) The paper is getting ways nicer! I will rewrite the simulation part tomorrow morning, is it OK for you? NetRAX_Bioinformatics_Paper_210803_180711.pdf

lutteropp commented 3 years ago

Thanks! Please check out the newest version of the document on Overleaf before rewriting the simulation part. Quite a lot of things changed since your commented version (I cited the new tool, added a flowchart describing the search algorithm, shortened and improved the text about the simulator, ...).

What do you mean by those squiggly lines? That we need better language?

lutteropp commented 3 years ago

I'm not sure you saw the comments @stamatak made in the Overleaf document. I'm directly adding both your and Alexi's comments in red text to the write-up, because it appears to me that the comments get lost otherwise.

celinescornavacca commented 3 years ago

That we need better language?

Yes.

I will check the newer version then.

On August 3, 2021 11:56:22 PM GMT+02:00, Sarah Lutteropp @.***> wrote:

Thanks! Please check out the newest version of the document before rewriting the simulation part. Quite a lot of things changed since your commented version (I cited the new tool, added a flowchart describing the search algorithm, shortened and improved the text about the simulator).

What do you mean by those squiggly lines? That we need better language?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/lutteropp/NetRAX/issues/2#issuecomment-892192514

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

lutteropp commented 3 years ago

I disagree with removing the "no arcs blabla present" annotations from the move pictures and moving them to the captions, as this wastes space in the writeup. But I anyway created new alternative images (_nolabel) without these annotations and put those infos into the captions, as you suggested. This way, we can decide on this later, depending on how much we need to shorten the document.

lutteropp commented 3 years ago

FYI: I am prioritizing finishing the remaining sections on the paper draft for now, because @stamatak needs a complete version of the draft Thursday morning.

celinescornavacca commented 3 years ago

@stamatak @lutteropp Are we aiming at 7 pages, bibliography included? And for the supp mat? How many pages should we keep?

stamatak commented 3 years ago

we aim for 8 pages, bib included,

for the supplement as many as required I would say,

alexis

On 04.08.21 11:34, celinescornavacca wrote:

@stamatak https://github.com/stamatak @lutteropp https://github.com/lutteropp Are we aiming at 7 pages, bibliography included? And for the supp mat? How many pages should we keep?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/2#issuecomment-892473936, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6UEZUIJT5CPF6ZAHBTT3D3STANCNFSM4TRAF6OQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Affiliated Scientist, Evolutionary Genetics and Paleogenomics (EGP) lab, Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology Hellas

www.exelixis-lab.org

celinescornavacca commented 3 years ago

OK, thanks. I will then shorten section 9 to the minimum.

celinescornavacca commented 3 years ago

@lutteropp can you post the sim script here such that I can check the values you cite in the paper? Thanks.

celinescornavacca commented 3 years ago

Also, did you change the values by yourself, or you kept the ones I gave you? (sorry but I forgot)

lutteropp commented 3 years ago

I copied and pasted the values I added to the paper from your simulation script. The script is here: https://github.com/lutteropp/NetRAX/blob/master/experiments/src/celine_simulator.py

Also, did you change the values by yourself, or you kept the ones I gave you? (sorry but I forgot)

I am pretty sure I kept the ones you gave me.

I have an additional outside script, where I discard unrecoverable networks returned by the simulator (https://github.com/lutteropp/NetRAX/blob/master/experiments/src/run_experiments.py#L23). Also, I overwrite the first-parent probabilities of the simulated network (https://github.com/lutteropp/NetRAX/blob/master/experiments/src/run_experiments.py#L123) for the experiments (for most experiments, I set all to 0.5).

lutteropp commented 3 years ago

we aim for 8 pages, bib included,

Only 8 pages? Oh my, I thought a paper has 15 pages... Okay, I will keep the experimental results and discussion as short as possible.

lutteropp commented 3 years ago

@stamatak @celinescornavacca Can we really submit to the Bioinformatics journal? I just saw this in their instructions for authors (https://academic.oup.com/bioinformatics/pages/instructions_for_authors):

Original Papers (up to 7 pages; this is approx. 5,000 words. excluding figures): Original papers that describe new research developments in computational molecular biology, for example: models, algorithms, software involving new methods, biological databases and network information services, and their impact on molecular biology or computer science. Actual biological data, as opposed to purely simulated data, must be used.

We will likely not have the empirical dataset inference finished on time. Those wheat genomes are crazy.

stamatak commented 3 years ago

On 04.08.21 12:51, Sarah Lutteropp wrote:

@stamatak https://github.com/stamatak @celinescornavacca https://github.com/celinescornavacca Can we really submit to the Bioinformatics journal? I just saw this in their instructions for authors:

I had assumed that you had read them already.

Original Papers (up to 7 pages; this is approx. 5,000 words. excluding figures): Original papers that describe new research developments in computational molecular biology, for example: models, algorithms, software involving new methods, biological databases and network information services, and their impact on molecular biology or computer science. Actual biological data, as opposed to purely simulated data, must be used.

We will likely not have the empirical dataset inference finished on time. Those wheat genomes are crazy.

Well, they are not too strict about this, but didn't we also have a smaller empirical dataset to experiment with?

Alexis

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/2#issuecomment-892522866, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6UU7I6UQAAQ5TSCKADT3EEQNANCNFSM4TRAF6OQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Affiliated Scientist, Evolutionary Genetics and Paleogenomics (EGP) lab, Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology Hellas

www.exelixis-lab.org

celinescornavacca commented 3 years ago

We will put what we have on the real data set at the time of the submission to show what the tool can run on real data set. It should be fine.

For the length, a lot of things can be shorten.

On August 4, 2021 11:51:02 AM GMT+02:00, Sarah Lutteropp @.> wrote: @. @celinescornavacca Can we really submit to the Bioinformatics

journal? I just saw this in their instructions for authors:

Original Papers (up to 7 pages; this is approx. 5,000 words. excluding figures): Original papers that describe new research developments in computational molecular biology, for example: models, algorithms, software involving new methods, biological databases and network information services, and their impact on molecular biology or computer science. Actual biological data, as opposed to purely simulated data, must be used.

We will likely not have the empirical dataset inference finished on time. Those wheat genomes are crazy.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/lutteropp/NetRAX/issues/2#issuecomment-892522866

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

lutteropp commented 3 years ago

I had assumed that you had read them already.

I read them a long time ago (after all, they don't say many memorable things), no idea why I had 15 pages in memory. Maybe because some papers I read have 15 pages.