How to account for correlations between functions

FabianRoger commented 5 years ago

The Hill framework allows to incorporate the correlations structure between functions to down-weigh correlated functions. However, it is not obvious what a good metric of correlation would be. Here I move the mail discussion to an Issue:

@jebyrnes for correlation, we never totally agreed on a metric. I was trying to look at some matrix algebra solutions, but I’m wary of them. Other than what’s there, are there other solutions you think we should look at? Average correlation isn’t it. Average absolute value correlation gets weird. Determinants was where I was going, but I was never totally satisfied. Thoughts?

@robertbagchi I think the determinant works as long as we don't get into negative correlations - the issue I have struggled with is that perfect negative correlations and perfect positive correlations both converge on det(A) = 0, and det(A) is greatest when everything is correlated with everything else.

@jebyrnes Right, but don’t we want to be able to think about both?

@robertbagchi I think so, yes.

I have wondered about the dominant eigenvalue (tells us how much of the total variation is captured by one axis). However, it seems like the other eigenvalues should be considered too, because if the top two eigenvalues capture the majority of variation in EF then that seems a very different case from when many eigenvalues are similar.

More generally, I think there is another layer that needs to be considered. We can consider the correlations in EFs at the community scale (which is what Dooley did) or at the species scale. It seems to me that actually, increasing MF would involve increasing the correlation among different EFs at the community scale, but the potential for Biodiversity to increase that correlation would be greatest when there is little correlation among species contributions to EFs.

Lars and Fabian's paper has also convinced me that we can't ignore competition, because that affects whether increasing one EF comes at the expense of the other.

Potential for a BEMF relationship will increase as (1) the correlations among species contributions to EFs decrease - i.e. different species contribute to different functions (duh, yeah).

(2) Species that contribute to different EFs have weaker competition coefficients (because otherwise they replace each other, so a gain in one EF = a decrease in another).

BEMF might therefore be captured with an equation of the form A^(-1).F, where A is the competition matrix for the J species (dimension JxJ) and F is a matrix which describes each species contribution to EFs (dimension JxK). F could itself be a realisation of a multivariate normal distribution, with correlation matrix B (dimension KxK), which defines the trade-off between species contributions to different EFs. It seems like linking the matrices A and B would be the way to understand the potential for BEMF - perhaps by linking A to resource use.

FabianRoger commented 5 years ago

To be honest, you pretty much completely lost me.

My thoughts so far:

We want to implement the possibility to account for correlations between functions. The reason is that some functions could well be intrinsically correlated (say above and belowground BM) and we don't want to count them as two functions that are independently maximized but downweigh their contribution to overall MF. This is exactly analogous to calculation the trait / functional diversity of a species assemblage and Chao has an implementation in her 2014 paper .

One implementation of this is in the hillR package in the hill_func function. The function has a trait argument which takes a species x traits matrix. By default the trait matrix

will be transformed into a distance matrix using 'FD::gowdis(traits)'. If all traits are numeric, then it will use Euclidean distance.

but a symmetric distance matrix can be used instead.

@jebyrnes You mention that you used the vegetarian package but as far as I can see it doesn't implement the inclusion of a correlation matrix, or?

My understanding of your discussion is that we need to decide

What would be a good distance / correlation metric?
How to distinguish positive and negative correlations?
On what data the correlations should be calculated on?

and here I have to admit that I don't follow your discussion. Two thoughts though.

@robertbagchi your last point seem very interesting (but way over my head). Keep in mind thought that we are aiming for an MF metric that also works outside of the BEMF context. So the correlations among functions necessarily need to be calculated across plots in my opinion. Whether, in a BEMF experiments, that should be all plots or only the monocultures is, however, not a trivial question.
I am not sure to follow the dominant eigenvalue discussion but we had a discussion at our WS about the PCA metric and concluded that one problem with it was how to orient the axis to make biological sense. The way this is done is to look at the sign of the dominant eigenvalue and change the sign to fit the biological meaning. However, this changes also the sign of all other eigenvalues which might not be much less important, and eigenvalues for the same function on different axis could be changed in different directions. This makes things weird.

FabianRoger commented 5 years ago

Probably not in this paper but in the future the way to go might be this

FabianRoger commented 5 years ago

another thought:

For functional diversity the default in the hillR::hill_func package is euclidean distances. For our case that would be the euclidean distance between functions using plot values as input. I made a repex to compare the euclidean distance with the correlations for a matrix with both positive and negative correlations (see below). Unsurprisingly I guess there is a negative relationship with negative correlations giving the greatest euclidean distance and strong positive correlation giving the smallest ones. Is that what we want? I guess if two functions would measure the same thing but on inverted scales (say leafN and soilN ?) they should be accounted for as similar but probably it's up to the researcher to orient functions so that negative correlation really means trade-off (so leafN and soilN uptake). Then euclidean distance seems like a reasonable choice?

thoughts welcome.

repex

library(Matrix)
library(corrplot)

# number of functions
func_n <- 10

# make correlation matrix 
cormat <- func_n*(func_n-1)/2
cormat <- runif(cormat, min=-0.5, max=1)
M <- matrix(ncol = 10, nrow = 10)
M[upper.tri(M,diag=F)] <- cormat
M[lower.tri(M,diag=F)] <- t(M)[lower.tri(M, diag=F)]
diag(M) <- 1

#make covariance matrix
sd <- rnorm(10, 5 , 2)
M <- M* (sd %*% t(sd))

#find closest positive definite covarinace matrix
M <- nearPD(M)$mat

#draw function values for 100 plots and 10 functions with the specified covariance structure
Func <- MASS::mvrnorm(100, rep(0.5,10), Sigma = M)

#standardise by max
Func <- apply(Func, 2, function(x) x/max(x))

#plot corelation matrix
Func_cor <- cor(Func)
corrplot(Func_cor, method = "ellipse", type = "upper", diag = F)

#euclidian dist
Func_dist <- as.matrix(dist(t(Func), method = "euclidian"))[lower.tri(cor(Func))]

#scatterplot eucleadian distance vs corelation
plot(Func_cor[lower.tri(cor(Func))], Func_dist, xlab = "correlation", ylab = "distance")

corrplot correlation structure

correlation vs distance

jebyrnes commented 4 years ago

So, I've been going back and forth on this, and I think I have something more satisfying (finally!) using Euclidian vector distances. Here's the basic idea.

First, we want to differentiate between positive and netagive correlations. If a system is working such that ALL values are correlated - but half are positively correlated and half are negative, that's REALLY different than one where everything is positive. We need to be able to account for that. I see no way around looking at both positive and negative correlations. Which, really, I think plotting a treatment against both positive and negative metrics is fine, as long as we do some standardization so that, for example, if it's mostly positive or mostly negative, one has a higher value than the other.

So, here's what I'm thinking.

For a distance of a vector x with j elements

$d = \sqrt{\sum{x_i^2}}$

So, let's say we take the lower triangular of a correlation matrix without the diagonal. This yields n correlation elements. If we split it into positive and negative correlations, we can define a metric, c+ or c- such that

$c{+ or -} = \frac{d{+ or -}}{\sqrt{n}}$

So, we've standardized by the largest possible distance. If everything is perfectly correlated, we get 1 for the positive and 0 for the negative. If it's 50:50, I believe we should get 0.5 for both the positive and negative, no?

Does this get us to where we want to go? For example, consider these random correlation matrices.

library("clusterGeneration")
#> Loading required package: MASS
library(ggplot2)
library(tidyr)

#The function to get distances
get_cor_dist <- function(mat){
  out <- list()

  values <- mat[lower.tri(mat)]

  out$positive <- sqrt(sum(values[values>0]^2))/sqrt(length(values))
  out$negative <- sqrt(sum(values[values<0]^2))/sqrt(length(values))

  as.data.frame(out)
}

alphad <- rep(c(1e-100, 1e-5, 1, 2, 5,10), 100)

cor_sims <- do.call(rbind,  lapply(alphad, function(x) get_cor_dist(rcorrmatrix(10, x))))
cor_sims$alphad <- alphad

ggplot(cor_sims %>% gather(sign, values, -alphad),
       aes(x = alphad, y = values, color = sign)) +
  geom_point() +
  facet_wrap(~sign)

^{Created on 2019-12-10 by the reprex package (v0.3.0)}

Right? So we could put anything on that x axis. Moreover, stacking positive and negative in the same graph works.

What do you guys think of this approach?

robertbagchi commented 4 years ago

Hmm. I'd like a metric that didn't require splitting up the data to get two numbers, but maybe that is just not a possibility.

One question I have is what if there is one negative correlation and the rest are positive - don't we need to somehow aggregate them eventually?

A minor point is that we should probably reverse the axis for the negative correlations (because a decrease and an increase mean opposite things) here.

I'd still like to find a metric that described the whole damn thing, but I'm not getting anywhere.

jebyrnes commented 4 years ago

Yeah, I hear you. I don't like splitting them, either. But right now, I don't see a way.

If one is negative, and the rest are positive, it's no problem. The negative scaled metric will be divided by the square root of the # of functions, so it will be very small.

We could definitely make the axes in opposition. Which almost makes you think you should sum.... but that way is pretty problematic, as you can have the lots of positive lots of negative = 0 sum, which is not right. For that reason - it might be better to keep it all on the same sign, and just split using color or something, to keep others from thinking about doing that.

robertbagchi commented 4 years ago

OK I get that.

Are the eigen options not worth it? I need to find a chunk of time to explore further (my grades were in today so maybe now!)

robertbagchi commented 4 years ago

One more thought though - do we need to solve this to move forward with the hill numbers approach? Isn't that a separate analysis?

jebyrnes commented 4 years ago

I kept running into odd issues due to the positive/negative issue, and numbers that just didn't totally make sense to me. Play with it a bit and let me know if you think I'm wrong. Meanwhile, I, uh, have to go back to writing this final!

jebyrnes commented 4 years ago

@FabianRoger seemed to want to include both. I think it's not a terrible idea, but might not be necessary. If it's a new metrics paper, meh, why not drop both? If we agree, that is, and if Fabian's simulations (ahem) show that both are robust and useful!

robertbagchi commented 4 years ago

So plan moving forward: Try both the positive/negative and hill numbers approaches in Fabian's simulations. If both work, present both in a paper. If one works, ,not the other, we have our answer. If neither works, hmm, back to the drawing board!

jebyrnes commented 4 years ago

BINGO!

@FabianRoger, not to press, but, do you have an ETA? Would be great to submit this soon!

FabianRoger commented 4 years ago

@jebyrnes I don't understand what you are doing, tbh. It seems like you suggest a new mf metric for correlated functions? And I am not sure I follow the logic of that metric, I'll need you to explain it to me. I am sure it's great but I don't grasp it.

I'm also afraid we misunderstood each other. What I want is a possibility to use the Hill metric and include correlations as Anne Chao has suggested in her 2014 paper. Basically Phylogenetic hill diversity but instead of using a tree / cladogramm, we use a distance metric. So if the average correlation is 0, it's the same as hill diversity, and the higher the correlation the lower the effective number of functions. The hillRpackage (as written above) has an implementation for this. But... we need to decide on a distance metric (see above).

Does you solution address this? Sorry if I am being stupid.

jebyrnes commented 4 years ago

OH! Huh. Did not totally get that. Yes, what I'm suggesting is a new metric to look at how correlated functions are sensu stricto.

I think I'm not sure on what the phylogenetic hill diversity approach would produce in terms of utility to an end users. For example, at high diversity, say, if you have all functions performing at their highest levels... but they are also highly correlated, how is that worse than, say, all functions at high levels but uncorrelated? To an end-user and from a theoretical perspective, I don't see how such a metric would be useful.

I don't see that at present - I'm fully willing to admit that I'm wrong, I'd just need to see how it could be made useful. Which, heck, I think you could do that with simulations! I'll comment in that thread about moving forward.

FabianRoger commented 4 years ago

Well, after the discussion we had at the multifunc workshop in Lund it seemed pretty clear that the field was yearning for a metric that can incorportae correlations. Meyer et als PCA metric has been specifically designed to get at the number of uncorelated functions, and Manning et al also suggest to cluster functions, bundle them, and weigh each bundle equally (which goes into the direction of the phylogenetic diversity but is very sensitive to how you cluster). Note that Chao doesn't go over the dendrogram but uses the distance matrix immediately which is the way to go. And if you have an ES with 5 functions that are perfectly correlated your very likely just measuring the same thing 5 times and your system only maximizes 1 function. So a system that maximizes 5 independent functions performs higher...

jebyrnes commented 4 years ago

Like I said, try it. I think it's a fine thing if the field is yearning for a method that incorporates correlations, but I think that method will be useless if it doesn't have real applicable meaning. (This is why I prefer CFA and SEM over PCA!) So, as long as you can explain what it MEANS and why one scenario is truly different from another, I'm willing to be convinced. Having a metric that incorporates correlations just to have a metric that incorporates correlations, but not really being able to stay what that metric means - or, worse, it's explanation ends up just leading to needing to look at the two simpler metrics in the first place that we've already created - doesn't seem like a useful way to go. Metrics are only as good as they have utility and meaningful interpretation. That's what I've always struggled with in the MF field. And why the additive average is seductive by wrong - it's MEANING is the estimated level of a randomly chosen function, not some true MF metric.

FabianRoger commented 4 years ago

check out this paper

Chao A, Chiu C-H, Villéger S, Sun I-F, Thorn S, Lin Y-C, et al. An attribute-diversity approach to functional diversity, functional beta diversity, and related (dis)similarity measures. Ecological Monographs. 2019;89(2):e01343.

FabianRoger commented 4 years ago

I uploaded a script called Function_correlation.Rmd

There I describe how we can use the FD metric suggested by chao et al (see above) to incorporate function correlations into MF. Note that I don't bother multiplying by the average MF here as it is a constant.

Please let me know what you think of that. If we agree that this could work I think we might have solved it - which means that from my side we are good to go. However, I need you to read the paper and look at the simulations and think this through so we are on the same page, I don't trust myself.

@jebyrnes @robertbagchi

jebyrnes commented 4 years ago

This is really good. And plugs in with the whole framework so well. I think we're close.

My question is negative correlations. I see how/why you dealt with them as you did, but consider the following

Let's say you have two functions. Both arise from the same underlying biological process. But, there is a complete tradeoff between them.

So, they can be f1, f2 0, 1 0.5, 0.5, 1, 0

In a plot where one is performing well and the other is 0, what should the value of the effective number of functions be? And, should it be higher or lower than a plot where both are at 0.5?

I think that's the core question. How does your approach resolve this, and what do you two think it should look like? @robertbagchi @FabianRoger?

robertbagchi commented 4 years ago

Nice work @FabianRoger !

I'm still trying to get my head around this - sorry for the slowness, been thinking about very different things.

In answer to @jebyrnes I think it depends right? I mean there isn't a one-size-fits all solution because in some cases the functions can all be reduced to the same currency and in others they can't, in some cases there are minimum thresholds we want to achieve and sometimes we could see diminishing returns after a point. Seems like we want (1) a framework that can accommodate all these scenarios, (2) sensible defaults.

I'll continue to think about this and hopefully have something more sensible and considered to say soon.

jebyrnes commented 4 years ago

Right. I guess this is the question - what are the 2-3 options for this scenario using Fabian's technique above? @FabianRoger, you've outlined one where if they are negatively correlated we assume they have 0 relationship to one another. And a second where we just add 1 to everything, no? But, do these do the same thing? And/or do we want an option where you take the absolute value of all correlations? ALthough that gives a slightly different answer. Hrm. @FabianRoger, as you've thought about this more deeply, can you list the options and their implications?

And I think if we are all in agreement, fold this into the MS, as well as the sims, and we're done! -ish.

robertbagchi commented 4 years ago

OK, I spent some more time cogitating this morning. I am not sure it makes sense to use the correlation matrix instead of trait distances. I see one pro and two cons to using the correlation matrix: pro - don't need to deal with extreme values, con 1: negative values, con 2: link to the actual biology is indirect.

Part of the rationale for using Chau's index is, if I'm right, being able to use tau to truncate distances - doesn't that solve the problem more seamlessly?

robertbagchi commented 4 years ago

Hey both - I added some theoretical considerations to the github site - check it out here @jebyrnes and @FabianRoger - does this shed any light? I feel like it should but have run out of bandwidth and time today.

FabianRoger commented 4 years ago

Hi both, and thanks for having a look at this!

here are my thoughts to your comments:

@jebyrnes your point is interesting but partly unrelated from the correlation issue. Without taking the correlation into account, both cases where only one function is present gives us an effective number of functions of 1 vegan::renyi( c(0,1), scales = 1, hill = TRUE) while when both functions are present effN = 2 vegan::renyi( c(0.5,0.5), scales = 1, hill = TRUE) (which makes sense at it is completely even). If we then multiply by 0.5 (average functioning) we get a multifunctionality of 1 for the even case and 0.5 for the uneven case. Is that what we want?

For the correlations:

in a correlation matrix, -1 the absolute maximum. So it doesn't matter how we transform the cor matrix to a dist matrix. In one case we say -1 --> 0 and calculate d as 1-0 = 1 (the largest possible distance). In the other case we shift the cor matrix by +1 (ranging from 0 to 2) and the say d = 2-cor in which case -1 --> 2 (the largest possible distance). In either case taking function correlation into account wouldn't change the effN value.

If we take absolute correlations, of cause, the story is different. Then -1 is the minimum possible distance and the effective number of functions should reduce the functions to 1 uncorrelated function. However, this scenario makes only really sense if we assume that a strong (perfect) negative correlation arises from measuring the same thing twice, just on inverse scales (say measuring bacterial biomass gain and nutrient depletion during the exponential phase). I concluded that it should be the researchers responsibility to acknowledge that and put the measures on the same scale. (biomass gain and nutrient uptake).

Also note (as you mentioned before I think?) that perfect negative correlations are only possible for two functions, more than two functions cannot be perfectly linearly negatively correlated.

The real question I think is: are negatively correlated functions more dissimilar than uncorrelated? If we scale the correlation matrix from 0 (-1) to 2 (1) and set tau to -0.5 we say that -0.5 is more distant than 0 (which is true if you look at the relationship between correlation matrix and euclidean distance above).

@robertbagchi to the use of eucledian distance vs correlation matrix. I think it makes sense to look at correlation matrix because it is unrelated from the number of functions.

For scaled function values, the relationship between the euclidean distance and the correlation matrix is

(sorry for that pic, I am bad at math notation)

see here

But while the euclidean distance is unbound, the correlation matrix is not. So in the example I give in the script for 10 functions with varying correlation strength) we get this relationship if we use the correlation matrix:

while we get this if use eucledian distance (setting tau = max(d) but I don't know how to choose an absolute tau otherwise)

FabianRoger commented 4 years ago

@robertbagchi re your theoretical consideration: I tried to follow your argument but it went way over my head. Would you have time to walk me through on zoom at some point?

The only comment I have for now is that what we have been trying to do so far is come up with a metric that only takes function values as input (independent of species) which might be warranted as we might not always relate it to species and/or don't know the species contributions.

So it might be that you are onto something much more meaningful but somewhat different?

FabianRoger commented 4 years ago

ps: one last thought. In the calculations above I set tau to 0 e.g. all non-zero correlations will reduce the effective number of function. Therefore with 0 average correlations we still get less than 10 (even if the maximum effective number of species is 10) but we could also define a significance threshold and set all non-significant functions to 0 or tau to say 0.2 or so - in order to only discount for 'real' correlations between functions

robertbagchi commented 4 years ago

Happy to have a chat over zoom at some point. Indeed, it might be good if we all touched base at some point.

I still need to think about the correlation as distances thing. Something isn't fitting together in my little brain.

jebyrnes commented 3 years ago

Damn, sorry I fell off on this. Stupid semester. OK. To recap the issues and my thoughts:

1) Are negatively correlated functions more dissimilar than uncorrelated? If we scale the correlation matrix from 0 (-1) to 2 (1) and set tau to -0.5 we say that -0.5 is more distant than 0 (which is true if you look at the relationship between correlation matrix and euclidean distance above).

2) Did we find a package with code that makes sense? The thread on https://github.com/daijiang/hillR/issues/14 seems to have run out.

(sidenote - I've been reading about multivariate coefficients of variation - see http://dx.doi.org/10.1016/j.jmva.2015.08.006 - does this provide any solution here?)

If you have an answer, @FabianRoger, let's hear it and go with it - otherwise, Zoom chat in two weeks? I can move scheduling to email.

FabianRoger commented 3 years ago

Hi Jarret,

Thanks for taking this up again and sorry it was the same for me. Lets schedule a zoom call and make a plan for how to finish this. This spring (until June or so) is sort of my last chance to get this done after all - afterwards we will move and I will need to focus on new tasks. Let's move to e-mail for scheduling. I am probably much more flexible than you guys, so I let you suggest dates.

jebyrnes commented 3 years ago

Based on the zoom call, @FabianRoger has written the function we need to address correlation based on Chao et al. and has implemented it with the correlation matrix for Tau. After reviewing his simulations, we agree that this works well, achieves our goals, and is good to go.

jebyrnes commented 2 years ago

Notes from 2022-01-20 meeting

Distance matrix - this is a paragraph!!
- mechanistic based matrix - if you can - hard
- function by function matrix - distance is correlation?
- flip and squish so -1 = 1 and 1 = 0 to make into a d mat
- euclidian distance?
- PCA matrix/manning - dendrogram?
- How would a manager make a distance matrix?
- Call for more research into this!!!
- Gower distance matrix?
Tau choice thing -
- function weighted mean?
- Rao's entropy! tau = Q = dmean - but, for a cor matrix?
- for cor, if tau = 0, we are saying 0 cor and neg cor are the same things
- if we truncate all neg numbers are 0, simplifies life and adds to power of stat?
- we know a lot of correlations are noise - don't use sig as it will be dependent on size of dataset
- small correlations shouldn't alter metric much, though
- maybe we don't need tau, as we want to allow for all correlations/distances
- ignore tau, as it were?
- see three critiques in eco mono 2019 paper of concentional FD
- we don't a priori care - we want to examine IF we care
- If you've added something to the mix, it means you care about it!
- This is a feature, not a bug, of working with functions instead of abundances
- set tau to max!!! as a default - but that folk can change it....for correlation
- beyond a threshold, we consider things TRULY different - so you can set Tau
- data driven? or natural history driven? tau profile

jebyrnes / new_multifunc_metric

How to account for correlations between functions #7