lucarraro / eDITH-SamplingStrategy

0 stars 0 forks source link

Fitting the eDITH model #1

Open adrian-v-m opened 2 years ago

adrian-v-m commented 2 years ago

Hello,

I am exploring the eDITH model, and I followed the simulation exercise described in the 'How to design optimal eDNA sampling strategies for biomonitoring in river networks' paper where you simulate the OCN (with create_OCN function) which can then be used to get a simulated DNA production rate p_sim.

However, I'm trying to understand how to fit the eDITH model on observed data to predict the DNA concentration within the OCN. Just like in the case of the simulated OCN and p_sim, you can use the eDITH functions eval_conc and eDITH to get the DNA concentration, and my understanding is that you need to use the observed OCN (potentially derived from a DEM or using local information about the hydrology and geomorphology) and the observed DNA production rate p_obs as arguments. Is my understanding correct? Or how would you use the DNA sequence counts data with the eDITH model?

And one last question. I was looking at step 4 in Figure 2 from the paper I mentioned above which states that the eDITH model can be fitted on the observed DNA concentration C_obs to get the modelled DNA production rate p_mod. Is this just as simple as extracting the unknown DNA production rate variable from equation 1 in the paper?

lucarraro commented 2 years ago

Hello,

In the exercise of the manuscript, the idea was to generate possible maps of eDNA production rate p_sim, produce eDNA concentrations resulting from these maps (via function eval_conc), and then fit the eDITH model on a subsample of these concentrations to try and see to which extent the eDNA production rates predicted by the model p_mod resemble the solution p_sim (see Fig. 2 of the manuscript).

If you want to fit eDITH to observed eDNA data, you need to maximize a likelihood function (i.e., find the p_mod and tau that produce predicted eDNA concentrations C_mod that are closest to the observed data at the sampling sites). In the code, such likelihood is evaluated via functions eval_loglik and eval_loglik_taufixed, and the parameters leading to maximum likelihood are sought via function optimParallel (lines 114-126 of perform_simulations.R).

Note that in general you cannot extract p from Eq. (1) given some observed C, as the number of sampled sites is typically much lower than the number of reaches into which the river network is partitioned (although the latter is not fixed, and depends on how the river network is discretized).

Currently, functions eDITH and eval_conc work only on OCN objects, hence the model cannot be applied as is to a real catchment with real eDNA data. This is something I'm actually working on.

Let me know if you need further explanations.

adrian-v-m commented 2 years ago

Hello,

Thanks a lot for your reply and explanations. It's becoming clearer now. I still have some more points I would like to bring forward.

I think I understand how you get p_sim, but it’s still not clear to me how you get from C_obs to p_mod via eDITH fitting to C_obs (step 4 in Figure 2).

Overall, my understanding is that, currently, I would need observed DNA production rate (p_obs) that I could potentially get from the lab so that I can fit eDITH to it and get a continuous raster map of modelled DNA concentration (C_mod). Also, to interpolate between p_obs sample points (get the raster map of modelled DNA production rate, p_mod), I would need to fit a model with an exponential link (probably a GLM) that links p_obs with some environmental covariates (it's Eq. (2) from the 'Environmental DNA allows upscaling spatial patterns of biodiversity in freshwater ecosystems' paper). Then I can use p_mod as an argument in the eval_conc function to get C_mod. Is my understanding correct?

Regarding applying eDITH to a real catchment, I think one of the main challenges is deriving the catchment and river network from a remote sensing source (which can probably be based on a digital elevation model) and then converting that into an OCN.

Thanks again.

Best, Adrian

lucarraro commented 2 years ago

Hello,

I would need observed DNA production rate (p_obs) that I could potentially get from the lab

In theory you can estimate these from the lab (say, how many moles of eDNA a given mass of taxon sheds in a tank per unit time for a given environmental condition (e.g., temperature)). The thing is, you don't know what is the taxon biomass at a river reach, so just knowing these shedding rates from the lab doesn't help much. Actually, the goal of the eDITH model is to estimate maps of p, which can be assumed to be proportional to the taxon density at a reach. If one knows the shedding rates from the lab, one can then convert maps of p (i.e., relative density) into maps of taxon absolute density.

so that I can fit eDITH to it and get a continuous raster map of modelled DNA concentration (C_mod).

To pass from p to C you don't fit the eDITH model, you just apply it: just use Eq. (1) as it's written, all the known terms are on the right-hand side and you obtain the unknown C.

to interpolate between p_obs sample points (get the raster map of modelled DNA production rate, p_mod), I would need to fit a model with an exponential link (probably a GLM) that links p_obs with some environmental covariates (it's Eq. (2) from the 'Environmental DNA allows upscaling spatial patterns of biodiversity in freshwater ecosystems' paper).

In a general case, we don't observe p_obs. This would be the real spatial distribution of taxon density, which is what we actually want to find. The GLM (Eq. (2)) could be used because:

  1. It reduces the number of unknown parameters to estimate (from p_obs values at any river reach, to the effect sizes of a number of covariates plus a baseline production rate p_0)
  2. It allows drawing some ecological conclusions (if e.g. a given taxon's density is found to be significantly correlated with forest cover)

What we actually observe at the sample points is C_obs. Since we are interested in p (i.e., spatial distribution of taxon density), not C (spatial distribution of eDNA concentration), we try to find what is the map of p that better explains the observed C. This is what the model fitting procedure does.

Then I can use p_mod as an argument in the eval_conc function to get C_mod.

It is the other way around (unless you are interested in assessing patterns of C for some reason).

Cheers, Luca

adrian-v-m commented 2 years ago

Hello again Luca,

Thank you again for responses. I understand that the goal is to estimate maps of DNA production rate, p, and how to pass from p to C using the eval_conc function which uses the eDITH function that incorporates Eq. (1) from the paper.

Since we are interested in p (i.e., spatial distribution of taxon density), not C (spatial distribution of eDNA concentration), we try to find what is the map of p that better explains the observed C. This is what the model fitting procedure does.

What I still don't understand is how to estimate/model p given we have sample points of C_obs. This is because the eDITH functions rather use p as an argument, so p has to be known or estimated from the drainage area extracted from the OCN ( Eq. (3)). I'm surely missing something, but how do we get from C_obs to estimated/model p?

Thanks.

Best, Adrian

lucarraro commented 2 years ago

You have to consider p and tau (or p_0, beta and tau in the NatComms paper) as unknown parameters, whose values are estimated so that the respective C_mod (i.e. produced by applying eDITH given those parameter values) is as close as possible to C_obs at the sampling sites.

In the Env DNA paper, this is done via an optimization algorithm (function optimParallel) that maximizes a likelihood function (i.e., roughly speaking, a function that assesses how far C_mod are with respect to C_obs at the sampling sites--the likelihood reaches its maximum when C_mod=C_obs at the sampling sites).

In the NatComms paper, this is done via a Bayesian approach. Also in this case a likelihood function (Eq. (4)) is maximized via an Adaptive Metropolis algorithm. Here there is an additional complication in that the compared modelled and observed quantities are read numbers (N), not concentrations, but the principle is the same.

Hope this is clearer now.

Cheers, Luca

adrian-v-m commented 2 years ago

Sorry for my delayed response. I was away for a few days. I very much appreciate your response. It's now clear for me how the model fitting (step 4 in Figure 2) is done using the optimParallel function (lines 114-126 of perform_simulations.R). This allows one to fit the eDITH to observed DNA data, and eventually obtain the modelled DNA production rate, p_mod, which is par vector in the results_all list (line 129 of perform_simulations.R). Am I correct?

The observed DNA data that can be used to fit eDITH is the DNA concentration at the sampling point (C_obs) or the DNA sequence reads (that can come from OTU tables, for example), so if I want to use the latter, is this just as simple as replacing C_obs with DNA sequence reads (N) in the optimParallel function? Since N = k C_obs or p'_0 = k p_0, I'm not sure how k is treated in the optimParallel function.

Thanks.

Best, Adrian

lucarraro commented 2 years ago

This allows one to fit the eDITH to observed DNA data, and eventually obtain the modelled DNA production rate, p_mod, which is par vector in the results_all list (line 129 of perform_simulations.R). Am I correct?

Yes, you are. More specifically, in this implementation, when tau is not fixed, the first component of par is the predicted value of tau, and the other components correspond to p_mod.

so if I want to use the latter, is this just as simple as replacing C_obs with DNA sequence reads (N) in the optimParallel function? Since N = k C_obs or p'_0 = k p_0, I'm not sure how k is treated in the optimParallel function.

In the NatComms2020 paper, I dealt with read numbers, and I hypothesized the expected value of read numbers to be proportional to the eDNA concentration (hence N = k C). However, k is not an additional free parameter, as its effects on the other parameters is just multiplicative, as it is the case for p_0 in that application. Thus I used p'_0 = k p_0 as free parameter. In the optimParallel function, nothing would change: the resulting par vector would be constituted by p' instead of p.

If you need further assistance for a specific application, don't hesitate to contact me at luca(dot)carraro(at)eawag(dot)ch.

Cheers, Luca

adrian-v-m commented 2 years ago

Thanks again for your prompt response.

I am basically trying to run an experiment with eDITH using an OTU (Operational Taxonomic Unit) dataset where there are multiple OTU IDs for each sample point. The OTU IDs are the columns containing sequence reads and the samples are the rows, and each sample has been collected at a given location (lat/long) along the river stretch. Each OTU ID can be identified at the species level, so I am planning to use the sequence reads (N) from each OTU ID (or from several OTU IDs) as input in the optimParallel function. More specifically, the argument for this input is ConcAG in the optimParallel function, so I would need to replace ConcAG=C_obs with ConcAG=N. Is this correct?

And since the optimParallel function also uses the OCN as an argument, my challenge would be to create an OCN object for the river catchment area where the samples were collected. Do you think I can create an OCN for the river network that I can derive from a DEM for that particular catchment area (e.g., using QGIS, etc.)? Or are you aware of any other method to approximate the OCN object for a real river catchment area?

Best, Adrian

lucarraro commented 2 years ago

More specifically, the argument for this input is ConcAG in the optimParallel function, so I would need to replace ConcAG=C_obs with ConcAG=N. Is this correct?

Yes, it is.

Do you think I can create an OCN for the river network that I can derive from a DEM for that particular catchment area (e.g., using QGIS, etc.)? Or are you aware of any other method to approximate the OCN object for a real river catchment area?

I do have an R code (under development) that extracts a river network from a DEM and produces an object that is equivalent to an OCN, so that the eDITH model can be used therein. You can contact me in private for further information.