hmsc-r / HMSC

GNU General Public License v3.0
102 stars 37 forks source link

Creating a spatial random level from a distance matrix #162

Open LukianAdams opened 1 year ago

LukianAdams commented 1 year ago

Hi all, I am currently learning Hmsc for a modelling study I am working on, and was looking for some assistance. The aim of my study is to model the impacts of dams and weirs on fish communities in rivers, using a large dataset of fish abundances sampled from Australian rivers. I have a nested design, with each river containing multiple sites, most of which have been sampled multiple times. The approach I am taking is to assign each sample site a number of barriers downstream, to see if there are discrete "jumps" in community composition at barriers as you move upstream, which would indicate a degree of fragmentation. I also want to see which species are most affected by these jumps.

At the moment I am using a simple model with "n_bars" (number of downstream barriers) as my only fixed effect, and a random effect at the sampling level to account for co-variation. However, I would also like to include a spatial random effect to my model, to control for the fact that distance along the river will naturally lead to changes in fish communities. While I have X and Y coordinates for my sites, simply including these would not be helpful, as sites are separated by their distance along the course of the river, not just their distance in space. Hence, I used the "riverdist" package to find the distance of each site from the river mouth, and then used that to create a matrix of distances between each site. Since distances between sites from different rivers wouldn't make sense, I am doing a separate model for each river.

According to the HmscRandomLevel documentation, it will accept a dist object with location labels, which is what I have put into it. This seems to work fine, and distMat in my HmscRandomLevel object looks like this, with header and row names refering to site IDs (I have 78 sites in total, this is just the first 8):

image

I have included these site IDs in my studyDesign, and in my ranLevels object the "spatial" list contains both the distance matrix, and a column of site IDs:

image

However, when I run sampleMcmc I get error:

Error in computeDataParameters(hM) : spatial distance matrix is non-metric or has duplicated points

I have been playing around with this for a couple days trying to format my distance matrix differently to fix this issue, but I can't figure out how to get it to work. I suspect I might need to get rid of the top triangle and the diagonal of the distance matrix, but these aren't actually in the dist object I input into HmscRandomLevel, and seem to be created in this function. I can't seem to get rid of them.

Here is my code in case this provides any clues:

#Creating a dataframe of distances from the river mouth for each SiteID in the Lachlan River

df_distance<- lach_env %>% group_by(SiteID) %>% summarise(Distance = first(Distance))

#Creating a matrix of differences in distance between each SiteID, to find linear distances between sites

dist_mat <- outer(df_distance$Distance, df_distance$Distance, FUN = function(x, y) abs(x - y)) colnames(dist_mat) <- df_distance$SiteID rownames(dist_mat) <- df_distance$SiteID

#Turning matrix into a distance object

dist_mat <- as.dist(dist_mat, diag = F, upper = F)

#Defining study design and structuring random factors

studyDesign <- as.data.frame(lach_env[,c("SiteID", "SampleID")]) %>% mutate_at(c("SiteID", "SampleID"), as.factor) %>% rename(spatial = SiteID, sample = SampleID) %>% droplevels()

rLsample <- HmscRandomLevel(units = studyDesign$sample) rLsite <- HmscRandomLevel(units = levels(studyDesign$site)) rLspatial = HmscRandomLevel(distMat = dist_mat, sMethod = "Full") ranLevels <- list(sample = rLsample, spatial = rLspatial)

#Creating and running Hmsc model

XFormula = ~ n_bars

fit_hmsc_lach <- Hmsc(Y=Y, XData = X, XFormula = XFormula, studyDesign = studyDesign, #Tr = Tr, #TrFormula = TrFormula, ranLevels = ranLevels, distr = "lognormal poisson")

nChains = 4 thin=1 nSamples=100 transient=10 verbose=TRUE

mod_hmsc_lach = sampleMcmc(fit_hmsc_lach, samples = nSamples, thin = thin, transient = transient, nChains = nChains, nParallel = nChains, verbose = verbose)

Apologies for the long post. Any help here would be greatly appreciated, cheers!

jarioksa commented 1 year ago

Hmsc really can accept input distances. However, those distances must be metric. One critical condition is that distance d() must satisfy d(A,B) ≤ d(A,C) + d(C,B), or that the shortest route between points A and B is a straight line, and you cannot shortcut this with going via point C. This is often known as the triangle condition. Your code is really obscure (and non-reproducible) but it seems that you are using Manhattan style distances abs(x-y) which do not satisfy this condition, and this is detected in the code and you get the informative error message that says that your distances are non-metric (like they are). It is not that you failed to format your data, but your distances are incompatible with the mathematics we use in Hmsc and they cannot be used.

jarioksa commented 1 year ago

I had a closer look at your code. It should work, or you should get valid distances with that code – provided your data (Distance) are one-dimensional. However, there is still another facet in the error message: do you have duplicated points? In distance matrices, duplicate points are the same as zero distances (off diagonal).

If you have one-dimensional data (locations on one dimension), you can define your spatial location with that location parameter (Distance) as sData instead of calculating distances, with similar conditions: duplicated locations cannot be handled.

LukianAdams commented 1 year ago

Thank you very much for your help Jari. Sorry for the lack of clarity with my code, this is my first time posting a question on Github, and looking back I should have included code to simulate my data to make it reproducible. Will keep this in mind for next time.

I was just working through your first reply, and found that the distances are indeed metric. Essentially what I have done is made both X and Y distance, and then plotted all my points along a diagonal line. Hence d(A,C) = d(A,B) + d(B,C) in all cases. This made me have another look at the data itself, and I discovered a duplicate point. I thought I had sorted this out by only including each SiteID once, but this data was not collected by me and I didn't realise that one of the locations was included under two different SiteIDs. Seems obvious looking back, but I got so caught up trying to figure out why my distance matrix was non-metric that I neglected to check that issue.

Good to know that I can simply use one-dimensional data with sData. That is much simpler than making a distance matrix, and I imagine that will save me a lot of headaches down the road.

Thanks again for your help, and for your contribution to this great package. It really is impressive what it can do. Cheers!

brendanf commented 1 year ago

Great that you've figured out how to get the single-river model to work! If you expect that the spatial effect should be the same (i.e., the same scale and magnitude) for every river, then there is also a way to do every river in the same model, without having any spatial dependence between sites on different rivers, which I learned at the Hmsc course last year.

The "trick" is to give a large spatial offset to points which are in different rivers; in your case, where the maximum true distances are on the order of 1000km, you could use 10000 as the offset between rivers. You could add this to your Distance coordinate, or supply 2-dimensional distances where x is your Distance, and y is 100000*RiverID. (As @jarioksa said, you don't need to calculate the distance matrix yourself in this case, just use Distance directly as a coordinate. The situation where you need that for rivers is if you want to calculate river-distance between points on different tributaries within the same river system.) Then, you have to set a custom prior for the scale parameter of the random effect, which constrains it to be within the range of your within-river distances. Because the between-river distances are at least 10× this much, this will result in a model which considers points on different rivers to be spatially uncorrelated.

# Here I assume you have 'env' as a data.frame containing `Distance` from river mouth, `SiteID` and `RiverID`;
# where `RiverID` is an integer
 df_distance <- env %>% 
  group_by(SiteID) %>%
  summarise(Distance = first(Distance), RiverOffset = max(Distance)*10*RiverID) %>%
  tibble::column_to_rownames("SiteID")

# set up the other parts of your model in the same way as before
...

# The spatial part

rLspatial <- HmscRandomLevel(sData = df_distance, sMethod = "Full")
rLspatial <- setPriors(
  rLspatial,
  alphapw = matrix(
    c(
      # first column: distances to consider
      (0:100)/100 * max(df_distance$Distance), # 0, plus 100 distances up to the max upstream
      # second column: prior probability
      0.5, # prior probability on 0 (i.e. no effect)
      rep(0.005, 100) # uniform prior on other distances
    ),
    ncol = 2
  )
)
LukianAdams commented 1 year ago

@brendanf Thanks for this! Running multiple rivers in one model would certainly be very useful for my study, as it allows me to compare the impacts of barriers on different rivers. I was wondering about this exact problem, and thinking I would have to find another way to control for the spatial effect in a multi-river model, e.g. using elevation as a covariate as a brute-force approach. I will definitely give what you have described a go :)

brendanf commented 1 year ago

Actually, I realize that my RiverOffset calculation is wrong because the max(Distance) was calculated within groups in the summarise(); depending on exactly how the max distances line up, this could result in the offsets of two different rivers being close to the same. This is better:

df_distance <- env %>%
  ungroup() %>% # just in case
  mutate(RiverOffset = max(Distance)*10*RiverID) %>%
  group_by(SiteID, RiverOffset) %>%
  summarise(Distance = first(Distance)) %>%
  tibble::column_to_rownames("SiteID")