interaction tree / clade: summary and plots

caterinap commented 7 years ago

Now interaction_clade_tree_phylm is functional but I still need to find a good way to summarize it. If we follow what happens in tree, we should give mean, CI_low and CI_high. But this leads to a long summary. Maybe we just give the standard error (or deviation) for each parameter? Still it is almost the double of the columns...

For the plots I was thinking to keep the same as the clade ones but add confidence intervals due to phylo variation around lines in both the scatterplots and the histograms. The histograms would include data on all trees as well (and have intervals around lines). We could also add std.errors (vertical and horizontal) on each dot in the scatterplot (but maybe too messy).

paternogbc commented 7 years ago

Hey Cat, I think a good way to solve this output problems is to focus on the original approaches from the non tree/intra functions. For example, in clade we are interested in checking if clade removal cause change different then a random removal of species. Thus in tree_clade we just want to do the same but considering thre variation on phylogeny or data (for intra_clade).

For the influ we want to check the most influential species considering the phy or data uncertainty. etc for the others.

What do you think?

caterinap commented 7 years ago

Yes, ok, makes sense, in any case, people can still run the individual functions if they want the phylo/intra details. But the "considering the variation on phylogeny or data" is the tricky part. The simplest solution that I can think of is to give a mean+deviation over all trees (example of tree function). So for example, for the clade functions you would have mean.slope, se.slope, DFslope, se.DFslope... etc. Or, other possibility, in the summary we just show the means for all trees and we have the true parameter accounting for uncertainty in tree/data will be the %change. And we can still show the variation in the plots.

caterinap commented 7 years ago

Still need to take a decision for clade_tree and clade_intra interaction: should we just plot a mean over all iterations or have intervals around the lines?

paternogbc commented 7 years ago

Not sure. We can leave it and discuss tomorrow I guess. The intervals might work, but I see a problem to calculate the randomization test between all simulations. We could just use the wider variation of the tree/intra simulations and continue with the standard approach of clade. But lets discuss this topic and decide the safest way to produce results.

caterinap commented 7 years ago

Ok, the summary for clade interactions now gives the % of times where it is not random:

(this would be the example for the paper, it shows that the results are robust to variation in tree topology, but some of the trees change the intercept (for Cebidae))

I did not modify the clade interaction plots yet. I am actually not sure about what we decided. Plotting the 95% intervals around the slopes might be a bit messy.

Now I am focussing on fixing small stuff and checking/changing summaries for all interaction functions.

paternogbc commented 7 years ago

Super nice @caterinap!!!! This results is very interesting indeed! =) It strongly suggests some biological mechanism related to this clade! Super cool! I think it will make the perfect example! Regarding the plot, lest keep the mean only and later we check if it makes sense to add the intervals, agree?

caterinap commented 7 years ago

Cannot agree more :)

gijsbertwerner commented 7 years ago

Great, really nice Cat! I will look into help functions later today :-)

Dr. Gijsbert Werner Newton Fellow, Department of Zoology Junior Research Fellow, Balliol College University of Oxford

Balliol College Broad Street Oxford OX1 3BJ United Kingdom

On 26 Jul 2017, at 16:17, Caterina Penone notifications@github.com<mailto:notifications@github.com> wrote:

Cannot agree more :)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/paternogbc/sensiPhy/issues/145#issuecomment-318085880, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGauyE_LpHhSXUoDfILLmoc-q6v8vXyaks5sR1iNgaJpZM4Ncfjp.

caterinap commented 7 years ago

Cool, let me about an hour to fix few things and I'll merge the branch in master! (In case it's useful I put a interaction_functions table in the Drive to check what is done for which function.)

paternogbc commented 7 years ago

not so sure about the name "Non random (%)"... but I have no other idea.

caterinap commented 7 years ago

Yeah, me neither, but I could not find better.. feel free to change anything there!

gijsbertwerner commented 7 years ago

'Significant (%)'? Or 'Significant Change (%)'?

caterinap commented 7 years ago

I have a slight preference for 'Significant (%)' but both are better than 'Non random (%)'!!

paternogbc commented 7 years ago

In samp method we use % Significant we could use the same or change both to Significant (%) ;)

paternogbc commented 7 years ago

Nice, so working on the paper example. What do you think about this type of visualization for the tree_clade interaction?

It shows the estimated slopes after removing the clade Cebidae (red line) across different tree iteration. the black points represent the slope estimates for the random removal of species (null distribution) for each tree.

It clearly shows that this clade is always lowering the slope much away from the expected difference based only on clade size. Thus, it demonstrates that this effect (first shown on clade_phylm, first paper example) still holds regardless of the phylogenetic hypothesis being considered. So there is something there!

That is why the Significant (%) = 100%! Because in all cases this clade removal deviates significantly from the null distribution.

caterinap commented 7 years ago

I think it is very nice! To be a bit more self-explanatory I would change the x-axis label into "Phylogenetic tree" or "Tree" and add in the legend that the black dots are the random draws (something like "dot=random").

paternogbc commented 7 years ago

New version with 100 trees and 100 n.sim!

caterinap commented 7 years ago

Super super cool!

gijsbertwerner commented 7 years ago

Wow! This is so cool! Love it :-)

Dr. Gijsbert Werner Newton Fellow, Department of Zoology Junior Research Fellow, Balliol College University of Oxford

Balliol College Broad Street Oxford OX1 3BJ United Kingdom

On 31 Jul 2017, at 11:53, Caterina Penone notifications@github.com<mailto:notifications@github.com> wrote:

Super super cool!

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/paternogbc/sensiPhy/issues/145#issuecomment-319022193, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGauyCSR45KpNeSaVQ2FuRPChWUfqASLks5sTaQ2gaJpZM4Ncfjp.

paternogbc commented 7 years ago

Clade Cebidae: Differs from Full data (decrease slope) and differs from null expectation

Clade Cercopithecidae: Differs from Full data (increase slope) but dont differ from null expectation Clade Lemuridae Dont differs from FULL data nor null expectation

paternogbc commented 7 years ago

Hey guys,

I developed this other graph to include together with the previous one.

The idea here is to show the general change in slope between all clades due to clade removal and compare with the full data fit (*black line). This would be the lieft graph while the above graphs would go on the right panel.

paternogbc commented 7 years ago

PS: the different points represent runs with different trees and the big point in represent the mean between trees for each clade. The idea is to get a graphical sense of the change caused by clade removal. So we can clearly see that Cebidae and Callitrichidae reduces the estimate while Cercopithecidae increases.

paternogbc commented 7 years ago

The right pannel would be this updated version. Please let me know what you think of both ;)

paternogbc commented 7 years ago

Something like that

Cebidae

paternogbc commented 7 years ago

Maybe just leave red point on the first graph to match with the red (without clade)

paternogbc commented 7 years ago

We can also change the dregree of dispersion in the jitter points

paternogbc commented 7 years ago

I would for that:

paternogbc commented 7 years ago

To solve the problem when too many trees are analyzed, I decided to remove the x axis text when the number of trees are higher then 30.

Then the users can explor which trees are changing the slope by hand with the raw data. I am developing this type of exploration for the online tutorial.

Solution:

When trees are up to 30 we can keep the names:

paternogbc commented 7 years ago

Finally, what do you think about highlighting the null distribution points that fall above or below the red line (to make it very clear how unique is the clade removal compared to the null expectation).

Possible graph for the paper (all three together) Clade Cebidae: Differs from Full data (decrease slope) and differs from null expectation

Clade Cercopithecidae: Differs from Full data (increase slope) but dont differ from null expectation

Clade Lemuridae Dont differs from FULL data nor null expectation

gijsbertwerner commented 7 years ago

Cool, very nice! I very much like the new figure you created :-)

Just so I understand correctly, these are two separate figures right? The top one for the Cebidae and the bottom for the Cerco’s.?

If it’s a four-panel figure, it might be a bit confusing that the (new) left figure is repeated twice?

Dr. Gijsbert Werner Newton Fellow, Department of Zoology Junior Research Fellow, Balliol College University of Oxford

Balliol College Broad Street Oxford OX1 3BJ United Kingdom

On 17 Aug 2017, at 02:01, Gustavo Paterno notifications@github.com<mailto:notifications@github.com> wrote:

Something like that

Cebidae

[image]https://user-images.githubusercontent.com/9639481/29391333-f819fa70-8339-11e7-97b2-c9603e15f3b6.png

[image]https://user-images.githubusercontent.com/9639481/29391512-2b19fc58-833b-11e7-971f-4ffdadabc366.png

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/paternogbc/sensiPhy/issues/145#issuecomment-322939957, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGauyBQaZKW9FyyqX2-7Pv9DRHZ9GY_Fks5sY5DTgaJpZM4Ncfjp.

gijsbertwerner commented 7 years ago

What would be an even more visually clear signal is that the dot plots are all in a given colour (let’s say greyish), and then that only the one that is represented to the right in more detail across all the tree’s would be plotted in red.

I am not sure if this is feasible, but if it is, it would give a really powerful visual signal that the one we are depicting in detail to the right is the same one as the one to the left in the same colour.

If not possible, I guess I would stick with former (i.e. different colours for each species), so people will not think there is a correspondence in terms of the colours.

Dr. Gijsbert Werner Newton Fellow, Department of Zoology Junior Research Fellow, Balliol College University of Oxford

Balliol College Broad Street Oxford OX1 3BJ United Kingdom

On 17 Aug 2017, at 02:01, Gustavo Paterno notifications@github.com<mailto:notifications@github.com> wrote:

Maybe just leave red point on the first graph to match with the red (without clade) [image]https://user-images.githubusercontent.com/9639481/29391549-6fed6f22-833b-11e7-80f0-7bc724223b15.png

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/paternogbc/sensiPhy/issues/145#issuecomment-322940041, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGauyNT7RRDCmwsKk8Dbph9oA_rrq9VNks5sY5D5gaJpZM4Ncfjp.

gijsbertwerner commented 7 years ago

Cool, I agree! I guess that in terms of the dispersion/jitter ideally we would still want to be able to detect different trees individually (but this will be challenging with many trees). Could we set the jitter distance dynamically, depending on the number of trees plotted?

Dr. Gijsbert Werner Newton Fellow, Department of Zoology Junior Research Fellow, Balliol College University of Oxford

Balliol College Broad Street Oxford OX1 3BJ United Kingdom

On 17 Aug 2017, at 03:17, Gustavo Paterno notifications@github.com<mailto:notifications@github.com> wrote:

To solve the problem when too many trees are analyzed, I decided to remove the x axis text when the number of trees are higher then 30.

[image]https://user-images.githubusercontent.com/9639481/29393204-cb4f37ec-8345-11e7-828d-64b12b7e6e27.png

Then the users can explor which trees are changing the slope by hand with the raw data. I am developing this type of exploration for the online tutorial.

Solution: [image]https://user-images.githubusercontent.com/9639481/29393210-dd96b3da-8345-11e7-8f56-8052ccea400b.png

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/paternogbc/sensiPhy/issues/145#issuecomment-322950343, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGauyNcHoFSYh129TQnB8c9rVyRvDf7zks5sY6K0gaJpZM4Ncfjp.

gijsbertwerner commented 7 years ago

Yes, I like highlighting the replicates/dots that are above/below the null distribution, but worry a bit that they make the figure a bit chaotic and difficult to interpret. Perhaps it would help if they can also be mention in the legend (as something like ‘Replicates above/below null distribution’, because now it’s the only colour that is not mentioned.

Dr. Gijsbert Werner Newton Fellow, Department of Zoology Junior Research Fellow, Balliol College University of Oxford

Balliol College Broad Street Oxford OX1 3BJ United Kingdom

On 17 Aug 2017, at 04:20, Gustavo Paterno notifications@github.com<mailto:notifications@github.com> wrote:

Finally, what do you think about highlighting the null distribution points that fall above or below the red line (to make it very clear how unique is the clade removal compared to the null expectation).

Possible graph for the paper (all three together) Clade Cebidae: Differs from Full data (decrease slope) and differs from null expectation [image]https://user-images.githubusercontent.com/9639481/29394574-974b4504-834e-11e7-92ef-9334a4448ba7.png

Clade Cercopithecidae: Differs from Full data (increase slope) but dont differ from null expectation [image]https://user-images.githubusercontent.com/9639481/29394552-66438c96-834e-11e7-90d7-ce514fb55786.png

Clade Lemuridae Dont differs from FULL data nor null expectation [image]https://user-images.githubusercontent.com/9639481/29394544-5983c098-834e-11e7-8e62-a0ba766f4973.png

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/paternogbc/sensiPhy/issues/145#issuecomment-322958404, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGauyIvzXAGbNJfzdSEPeKBMeuNbWjxRks5sY7GIgaJpZM4Ncfjp.

paternogbc commented 7 years ago

Hey @gijsbertwerner, thanks for the feedback.

So lets keep it simple. the sensi_plot() for tree_clade will produce 2 graphs. One comparing all clades estimates with the full data estimate. This will give a general overview of each clade influence on estimate (left graph).

The second graph will focus on one clade, comparing the clade removal with the null distribution across all trees analyzed (right graph).

the comand: sensi_plot(clade_tree,clade = "Cercopithecidae") will produce this figure:

While the command sensi_plot(clade_tree,clade = "Cebidae")

I know this repeats the first graph for all clades, but the user can choose to print only the second graph (e.g. sensi_plot(clade_tree, graphs = 2, clade = "Cebidae")

For the paper I think we have two options: 1: to use only the null ditribution graphs for Cercopithecidae, Cebidae and Lemuridae. Something like that:

2: to use the null distribution of these clades plus the clade comparisons (firrst graph). Something like that:

Of course we would still polish that on inkscape.

So let me know if you have any other suggestion regarding the function plots, then I can prepare a multiplot for the paper and we can discuss that graph in google docs ;)

gijsbertwerner commented 7 years ago

Looks good to me!

Dr. Gijsbert Werner Newton Fellow, Department of Zoology Junior Research Fellow, Balliol College University of Oxford

Balliol College Broad Street Oxford OX1 3BJ United Kingdom

On 18 Aug 2017, at 07:39, Gustavo Paterno notifications@github.com<mailto:notifications@github.com> wrote:

Hey @gijsbertwernerhttps://github.com/gijsbertwerner, thanks for the feedback.

So lets keep it clean. =) What you think about that for the paper? [rplot pdf]https://user-images.githubusercontent.com/9639481/29445670-30c1050c-842b-11e7-973f-55fd49da0253.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/paternogbc/sensiPhy/issues/145#issuecomment-323264317, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGauyDs_vUNz0m1mKWQMG8twlB506pvXks5sZSN3gaJpZM4Ncfjp.

caterinap commented 7 years ago

All super nice!!! I like the second option where you see all clades + 3 clades with the trees. I am not sure how to show that the red line "significantly" differs from the black one or the null distribution (we can guess it but not be sure of it). What about the option we discussed to add a short text (in the graph) giving the % of significant or something else? Maybe too messy..

paternogbc commented 7 years ago

Ok. I had split the graphs in two figures (all clades and the blue graphs). But Maybe it is better to use the option 2 approach. Regarding the red line, I can add the % of significant iterations on the top together with the other info. I guess it is ok.

paternogbc commented 7 years ago

For example:

gijsbertwerner commented 7 years ago

Ah, sorry, I forgot to reply to the question of whether to go for option 1 or option 2 in the main text. I would certainly be in favour of option 2. I like how it combines everything in a single overview, and thing that the second example should ideally have only a single figure (with panels), like the first. Definitely go for two!

Yes, let's include the percentage of significant iterations above the plot, I like it, and I think will really help clarify for the user (and in the ms)!

paternogbc commented 7 years ago

Perfect!

paternogbc / sensiPhy

interaction tree / clade: summary and plots #145

Cebidae