lczech / gappa

A toolkit for analyzing and visualizing phylogenetic (placement) data
GNU General Public License v3.0
57 stars 7 forks source link

non-monotonic correlation #29

Closed blancobercial closed 3 weeks ago

blancobercial commented 5 months ago

Hello!

We are using gappa with metabarcoding of marine organisms and many of them they do not have a liner relationship with depth/density/temperature/etc., but they peak however an intermediate values (e.g., they will not be in 0-300, nor in 700-1000, but have the maxima only in the 500m range; both Pearson and Spearman will not find that significant). The available correlations assume a monotonic relationship, and I was wondering if it would be possible to add a non-monotonic correlation (Kendall, maybe?). Thank you for the software!

Best,

Leo

lczech commented 5 months ago

Hi Leo @blancobercial,

thanks for the suggestion - I wanted to implement Kendall at some point before, and now you gave me the incentive to do it :-). It was quite tricky to account for all those ties in the values that come from edges with no placements on them... But I managed. On the dev branch, there is now an additional --method kendall selection available for the command. It computes the tau-b variant; I have also implemented tau-a and tau-c, but decided to not offer them in the command interface now, to keep it simple. Please check out the dev branch and try if that works for you. It will be part of the next release version then as well.

However, as for your idea that this is truly a non-monotonic correlation: While looking into the implementation of Kendall's tau, I found some interesting sources. First, I found this thread, which states:

While Tau allows for nonmonotonic underlying curve and measures which monotonic "trend", positive or negative, prevails there overall.

So it might not be quite what you need, but might help. Second, this thread mentions the generalized additive model (GAM) as a solution. This seems a bit more involved though. If this is interesting for you, I could look into this.

Also, I wanted to mention the placement-factorization command, which uses a generalized linear model (GLM or GLiM, depending on literature) to determine relationships between placement masses and meta-data features. This is still linear though, and does not solve your issue, but it has the power to utilize multiple features at once, which might be interesting for you. I've always considered placement-factorization as a more advanced version of the edge correlation, as it basically does the same, but with multiple features, and nested factors. However, due to the use of the GLM instead of a plain correlation, the resulting values per edge are not quite as straight forward to interpret, as they express the fit of the model, instead of a correlation.

Lastly, an alternative for this whole approach that I am just now thinking about: How about transforming your meta-data feature? Maybe there is some function that makes sense to apply to your values that would "linearize" them with respect to what you want to achieve? Not quite sure if that makes sense statistically, but it could be a simple solution.

Let me know if that help, and what you think!

Cheers and so long Lucas

blancobercial commented 5 months ago

Hi Lucas

THANK YOU! SO fast!!

I am now traveling for a couple workshop + symposium, and my student is at present at sea. I will meet with her next week on the Ocean Sciences Meeting (she finishes her cruise right on time to get the plane to New Orleans) and we will give it a shot!

It is true that might not be perfect, and I was also thinking how to better represent what I look for (maybe average depth for the branches? Or how much each branch is distributed narrower?). I will give a read to the links… Our problem is that we have some lineages that are completely linked to the some specific layer… But how to get to that? Even within a class, some will be surface (great – negative with depth!) but the others are at 500 (so, in the middle of the profile) and others at 1000m (so perfect again – negative with depth). And still, the ones at surface will be zero below 200, and the ones in depth will be 0 from surface to 700 m.. And then BOOM!.

Anyhow, again _ THANK YOU. And, I will touch base again after next week. I am very interested in how to explore this!

Leo

lczech commented 5 months ago

Hi Leo @blancobercial,

so, I've thought a bit more about this, and also asked a friend statistician for some input. In your case, a GAM as mentioned before might be an overkill, as it is "too" powerful - it could fit any set of splines to your curve, which would give a good predictive fit, but make interpretation hard. Probably not what you want.

If I understand you correctly though, you assume that some form of roughly quadratic relationship exists between depth and abundance. That can be modeled with a GLM, as mentioned above. Let's say for example you have the following data:

image

where abundances follow a strict quadratic pattern with the depth (in my example, 0.4*x - 0.0004*x^2, to be exact, so that we get some intuitive numbers). Now, by using the first two columns as predictors and the third as the response in a GLM, we get a perfect fit of the model. The correlation command can only correlate abundances with a single variable though.

But as mentioned, the placement-factorization command uses a GLM, and can hence model this, by simply giving it the two first columns of the table as meta-data. (For simplicity of the command line interface, I have not implemented the usage of interaction terms and variable transformations in the command - in R, you would do abundance~depth+I(depth^2) or something like that. Here, instead, you simply have to calculate these terms beforehand and add them as columns to the meta-data table yourself.) Shameless self-advertisement: For details on Placement-Factorization, have a look at my dissertation, Chapter 7.

The output of that command is slightly different from the correlation command. What you are looking for are the tree figures called objective_values, which visualize the fit of the model for each branch of the tree. See for instance Figure 7.5 in my dissertation. As we are fitting abundances (or rather, balances thereof) to the meta-data (depth), this is thus similar to the trees produced by the correlation command. But instead of a correlation coefficient, these figures show model fit. Same idea though - you see a visualization of how strong the balance~depth+I(depth^2) model fits each branch (or whichever other columns you had in your meta-data). You won't be able to see "negative" correlations with this though, as the model fit is non-negative (the model would simple fit x with a negative coefficient then to maximize fit).

Using Placement-Factorization will furthermore also allow you to discover deeper relationships in your abundances. Correlation only gives you the single most important "factor" in your data - which would correspond to the first iteration of Placement-Factorization. But you can run the factorization further, getting nested clades that might also exhibit a relationship between abundances and depth, but that were overshadowed by a clade with a stronger relationship, or only exist nested in part of the tree, and were hence not visible in early rounds of the algorithm. Each further iteration "splits away" another clade from the tree, splitting it into smaller clades, each exhibiting a (nested) relationship with the meta-data.

If needed, I could extend the output of the Placement-Factorization to include more information on top of the model fit per branch, and for instance output the coefficients that were computed to fit GLM - that is, recover the 0.4*x - 0.0004*x^2 coefficients from above, for instance. That would be per branch, and could be interesting to analyze downstream. Not quite sure, but I think my implementation should allow for extracting that information - I'll have to check.

Let me know what you think, and very happy to hear about any experiments with that approach!

Cheers and so long Lucas

blancobercial commented 5 months ago

Hi Lucas

On a workshop Fri and Sat (I was flying today) but I will come back to you. I Was thinking a bit on the whole thing during the flight. Some distributions examples (sampling down to 1000m):

[Chart, surface chart Description automatically generated]

Meanwhile the above could be negative with depth, the middle one will appear as not correlated (similar to the example you sent). I will look at the documents you send carefully after my workshop and Ocean Sciences Meeting (starting Monday).

THANK YOU!!!

Leo

lczech commented 5 months ago

Hi Leo,

you seem to have answered via email instead of the GitHub page, and your image/chart was not uploaded. Could you please try again using the web interface (https://github.com/lczech/gappa/issues/29).

Thanks and so long Lcas

blancobercial commented 5 months ago

here you go! Yes from email :)

image

lczech commented 5 months ago

Hi Leo @blancobercial,

some news! I now implemented the computation of the GLM coefficients, and added this as output to placement-factorization. I've also finally used this as incentive to extend the documentation of that command, so that hopefully things there make more sense now.

So, what would that allow you to do? As mentioned, Placement-Factorization can be somewhat understood as an extension of the correlation command, but for multiple input meta-data features at once. This is due to the GLM being used there instead of a correlation coefficient to measure predictability of the abundances in each clade from the meta-data. In your case, you are interested in abundances ~ depth + depth^2, so you'd make a meta-data table with these two columns, and provide that to the command.

Then, as described here, the command will give you the GLM coefficients of the winning edge of each iteration of the algorithm. That means, in the first iteration (which is the most significant one), there will be one edge where the relationship between abundances and meta-data is the strongest. That's the "winner" of that iteration. The coefficients then describe how the GLM fitted your meta-data to the (balances of) the abundances of that edge. In my example from the post above, that would be 0.4*x - 0.0004*x^2, for instance. You can then use those coefficients to plot the relationship, and see how well it fits - there is a script provided there as well, see the doc page.

In your case though, I think you might want to just plot balances vs depth, and omit the depth^2 term in the plot, as otherwise, you'd see the linear fit of the model, but not the quadratic relationship you are looking for.

Hope that helps, cheers and so long Lucas

blancobercial commented 4 months ago

Hi Lucas

Back from sea, trying to catch up with life.

I will go now into this thread. I will ask the server people to update the package (we use the ASU server for all this work!).

More soon!

Leo

lczech commented 4 months ago

Okay, sounds good! It's currently only on the dev branch, and not yet a release version. Let me know how things go :-)

lczech commented 3 weeks ago

Hi @blancobercial,

I've just relased gappa v0.8.5, in order for the new Kendall's Tau variant to finally be in a release version as well. Nothing else changed, but thought I'd let you know.

I'll close the issue for now then, but if you have any more questions or suggestions, feel free to re-open or start a new one!

Cheers Lucas

blancobercial commented 3 weeks ago

Hello Lucas

THANK YOU! We are actually working on the data, and polishing some of the analyses. We have a very intense year of field sampling and laboratory work, so we cannot work on it continuously, but it is moving forward!

Best,

Leo