vegandevs / vegan

R package for community ecologists: popular ordination methods, ecological null models & diversity analysis
https://vegandevs.github.io/vegan/
GNU General Public License v2.0
436 stars 95 forks source link

betadisper calculate distance between samples and the centroid of a different group. #606

Open joshgsmith opened 8 months ago

joshgsmith commented 8 months ago

The documentation is clear that betadisper() computes the distance between samples and their respective centroid or median while ensuring positive-definite eigenvalues. betadisper() also returns the principal coordinates of centroids, and these can be used to calculate the distances among centroids. However, I do not see any functionality to calculate the distance between samples and a centroid belonging to another group. For example, lets say we have 100 samples (we will call them 'sites'), 50 sites belonging to period == "Before" and 50 sites to period == "After." How can we determine the distances between each site belonging to period == "Before" and the centroid of period =="After"?

where m is a distance matrix, something like: disper_mat <- betadisper_mod(m, type="centroid", group = group_vars2$period) returns the distances between sites and their respective centroids, independently (in this case, one for Before and one for After)

If we want the principal coordinates of each centroid, we could use:

shift_dist <- reshape2::melt(as.matrix(sqrt(dist(m$centroids[,m$eig>0]^2)- dist(m$centroids[,m$eig<0]^2))))%>% tibble::rownames_to_column("distance")

However, shift_dist only finds the distance between the two centroids, not the distance between each samples and the centroid of a different group.

In both chunks above, only the within-group distances are calculated (distances from sites to their within group centroid). Is it possible to calculate the distance both within group and across groups? Specifically, the across group component is the distances of samples belonging to group Before to the centroid belonging to group After.

This would be a fantastic utility, particularly when dealing with time series and ecological data to examine multivariate 'shift distance' relative to a centroid defined by a certain time period. As an example, lets say we have ecological abundance data spanning 2000-2023. We could use the centroid of years 2000-2005 to describe the 'reference' period, then examine the annual shift distances for each year of the time series to estimate how much the community changes during the reference period vs. each year after that.

jarioksa commented 8 months ago

There is no such function. However, this is R and you can always write such a function!

Here is a function that calculates distances from each sampling unit to each centroid:

`betadistances` <-
    function(x)
 {
     cnt <- x$centroids
     coord <- x$vectors
     pos <- which(x$eig >= 0)
     neg <- which(x$eig < 0)
     d <- apply(cnt[,pos], 1,
                function(z) rowSums(sweep(coord[,pos], 2, z)^2))
     if (length(neg))
         d <- d - apply(cnt[, neg], 1,
                        function(z) rowSums(sweep(coord[,neg], 2, z)^2))
     d <- as.data.frame(sqrt(d))
     cbind("group" = x$group, d)
 }

This is a proof-of-concept implementation and may not cover all corner cases.

Is this the function you asked for? What do you think we should do with this? Comments @gavinsimpson

Note: vegan has a related function meandist, but it calculates mean distances among points and not distances to centroids.

joshgsmith commented 8 months ago

@jarioksa this is very nice! I'm not sure I fully understand how the distances are calculated without calling dist() in that function, but I will apply it my my actual data today to test its functionality.

I was toying with something like:

shift_dist <- sqrt(dist(x$vectors[,x$eig>0]^2, x$centroids[,x$eig>0]^2)- dist(x$vectors[,x$eig<0]^2, x$centroids[,x$eig<0]^2)) Which doesn't seem to produce the same distances as the function you provided.

The betadistances function appears to work very well, and using usedist::dist_setNames() on the original distance matrix helps to keep track of the sample names through betadisper and betadistances.

jarioksa commented 8 months ago

@joshgsmith your way will not work. I saw you crossposted to StackOverflow. This is not a good habit to collect the answers. The usedist package suggested in StackOverflow won't work with semimetric dissimilarities (such as Jaccard, Bray-Curtis etc). This is documented in the usedist (but naturally, the developer may change that later). The method suggested above will also work with semimetric dissimilarities (non-semidefinite matrices).