Open adrian-v-m opened 3 years ago
Hello,
In the exercise of the manuscript, the idea was to generate possible maps of eDNA production rate p_sim
, produce eDNA concentrations resulting from these maps (via function eval_conc
), and then fit the eDITH model on a subsample of these concentrations to try and see to which extent the eDNA production rates predicted by the model p_mod
resemble the solution p_sim
(see Fig. 2 of the manuscript).
If you want to fit eDITH to observed eDNA data, you need to maximize a likelihood function (i.e., find the p_mod
and tau
that produce predicted eDNA concentrations C_mod
that are closest to the observed data at the sampling sites). In the code, such likelihood is evaluated via functions eval_loglik
and eval_loglik_taufixed
, and the parameters leading to maximum likelihood are sought via function optimParallel
(lines 114-126 of perform_simulations.R
).
Note that in general you cannot extract p
from Eq. (1) given some observed C
, as the number of sampled sites is typically much lower than the number of reaches into which the river network is partitioned (although the latter is not fixed, and depends on how the river network is discretized).
Currently, functions eDITH
and eval_conc
work only on OCN objects, hence the model cannot be applied as is to a real catchment with real eDNA data. This is something I'm actually working on.
Let me know if you need further explanations.
Hello,
Thanks a lot for your reply and explanations. It's becoming clearer now. I still have some more points I would like to bring forward.
I think I understand how you get p_sim
, but it’s still not clear to me how you get from C_obs
to p_mod
via eDITH fitting to C_obs
(step 4 in Figure 2).
Overall, my understanding is that, currently, I would need observed DNA production rate (p_obs
) that I could potentially get from the lab so that I can fit eDITH to it and get a continuous raster map of modelled DNA concentration (C_mod
). Also, to interpolate between p_obs
sample points (get the raster map of modelled DNA production rate, p_mod
), I would need to fit a model with an exponential link (probably a GLM) that links p_obs
with some environmental covariates (it's Eq. (2) from the 'Environmental DNA allows upscaling spatial patterns of biodiversity in freshwater ecosystems' paper). Then I can use p_mod
as an argument in the eval_conc
function to get C_mod
. Is my understanding correct?
Regarding applying eDITH to a real catchment, I think one of the main challenges is deriving the catchment and river network from a remote sensing source (which can probably be based on a digital elevation model) and then converting that into an OCN.
Thanks again.
Best, Adrian
Hello,
I would need observed DNA production rate (p_obs) that I could potentially get from the lab
In theory you can estimate these from the lab (say, how many moles of eDNA a given mass of taxon sheds in a tank per unit time for a given environmental condition (e.g., temperature)). The thing is, you don't know what is the taxon biomass at a river reach, so just knowing these shedding rates from the lab doesn't help much. Actually, the goal of the eDITH model is to estimate maps of p
, which can be assumed to be proportional to the taxon density at a reach. If one knows the shedding rates from the lab, one can then convert maps of p
(i.e., relative density) into maps of taxon absolute density.
so that I can fit eDITH to it and get a continuous raster map of modelled DNA concentration (C_mod).
To pass from p
to C
you don't fit the eDITH model, you just apply it: just use Eq. (1) as it's written, all the known terms are on the right-hand side and you obtain the unknown C
.
to interpolate between p_obs sample points (get the raster map of modelled DNA production rate, p_mod), I would need to fit a model with an exponential link (probably a GLM) that links p_obs with some environmental covariates (it's Eq. (2) from the 'Environmental DNA allows upscaling spatial patterns of biodiversity in freshwater ecosystems' paper).
In a general case, we don't observe p_obs
. This would be the real spatial distribution of taxon density, which is what we actually want to find. The GLM (Eq. (2)) could be used because:
p_obs
values at any river reach, to the effect sizes of a number of covariates plus a baseline production rate p_0
)What we actually observe at the sample points is C_obs
. Since we are interested in p
(i.e., spatial distribution of taxon density), not C
(spatial distribution of eDNA concentration), we try to find what is the map of p
that better explains the observed C
. This is what the model fitting procedure does.
Then I can use p_mod as an argument in the eval_conc function to get C_mod.
It is the other way around (unless you are interested in assessing patterns of C
for some reason).
Cheers, Luca
Hello again Luca,
Thank you again for responses. I understand that the goal is to estimate maps of DNA production rate, p
, and how to pass from p
to C
using the eval_conc
function which uses the eDITH
function that incorporates Eq. (1) from the paper.
Since we are interested in p (i.e., spatial distribution of taxon density), not C (spatial distribution of eDNA concentration), we try to find what is the map of p that better explains the observed C. This is what the model fitting procedure does.
What I still don't understand is how to estimate/model p
given we have sample points of C_obs
. This is because the eDITH functions rather use p
as an argument, so p
has to be known or estimated from the drainage area extracted from the OCN ( Eq. (3)). I'm surely missing something, but how do we get from C_obs
to estimated/model p
?
Thanks.
Best, Adrian
You have to consider p
and tau
(or p_0
, beta
and tau
in the NatComms paper) as unknown parameters, whose values are estimated so that the respective C_mod
(i.e. produced by applying eDITH given those parameter values) is as close as possible to C_obs
at the sampling sites.
In the Env DNA paper, this is done via an optimization algorithm (function optimParallel
) that maximizes a likelihood function (i.e., roughly speaking, a function that assesses how far C_mod
are with respect to C_obs
at the sampling sites--the likelihood reaches its maximum when C_mod=C_obs
at the sampling sites).
In the NatComms paper, this is done via a Bayesian approach. Also in this case a likelihood function (Eq. (4)) is maximized via an Adaptive Metropolis algorithm. Here there is an additional complication in that the compared modelled and observed quantities are read numbers (N
), not concentrations, but the principle is the same.
Hope this is clearer now.
Cheers, Luca
Sorry for my delayed response. I was away for a few days. I very much appreciate your response. It's now clear for me how the model fitting (step 4 in Figure 2) is done using the optimParallel
function (lines 114-126 of perform_simulations.R
). This allows one to fit the eDITH to observed DNA data, and eventually obtain the modelled DNA production rate, p_mod
, which is par
vector in the results_all
list (line 129 of perform_simulations.R
). Am I correct?
The observed DNA data that can be used to fit eDITH is the DNA concentration at the sampling point (C_obs
) or the DNA sequence reads (that can come from OTU tables, for example), so if I want to use the latter, is this just as simple as replacing C_obs
with DNA sequence reads (N
) in the optimParallel
function? Since N = k C_obs
or p'_0 = k p_0
, I'm not sure how k
is treated in the optimParallel
function.
Thanks.
Best, Adrian
This allows one to fit the eDITH to observed DNA data, and eventually obtain the modelled DNA production rate,
p_mod
, which ispar
vector in theresults_all
list (line 129 ofperform_simulations.R
). Am I correct?
Yes, you are. More specifically, in this implementation, when tau
is not fixed, the first component of par
is the predicted value of tau
, and the other components correspond to p_mod
.
so if I want to use the latter, is this just as simple as replacing
C_obs
with DNA sequence reads (N
) in theoptimParallel
function? SinceN = k C_obs
orp'_0 = k p_0
, I'm not sure howk
is treated in theoptimParallel
function.
In the NatComms2020 paper, I dealt with read numbers, and I hypothesized the expected value of read numbers to be proportional to the eDNA concentration (hence N = k C
). However, k
is not an additional free parameter, as its effects on the other parameters is just multiplicative, as it is the case for p_0
in that application. Thus I used p'_0 = k p_0
as free parameter. In the optimParallel
function, nothing would change: the resulting par
vector would be constituted by p'
instead of p
.
If you need further assistance for a specific application, don't hesitate to contact me at luca(dot)carraro(at)eawag(dot)ch.
Cheers, Luca
Thanks again for your prompt response.
I am basically trying to run an experiment with eDITH using an OTU (Operational Taxonomic Unit) dataset where there are multiple OTU IDs for each sample point. The OTU IDs are the columns containing sequence reads and the samples are the rows, and each sample has been collected at a given location (lat/long) along the river stretch. Each OTU ID can be identified at the species level, so I am planning to use the sequence reads (N
) from each OTU ID (or from several OTU IDs) as input in the optimParallel
function. More specifically, the argument for this input is ConcAG
in the optimParallel
function, so I would need to replace ConcAG=C_obs
with ConcAG=N
. Is this correct?
And since the optimParallel
function also uses the OCN as an argument, my challenge would be to create an OCN object for the river catchment area where the samples were collected. Do you think I can create an OCN for the river network that I can derive from a DEM for that particular catchment area (e.g., using QGIS, etc.)? Or are you aware of any other method to approximate the OCN object for a real river catchment area?
Best, Adrian
More specifically, the argument for this input is
ConcAG
in theoptimParallel
function, so I would need to replaceConcAG=C_obs
withConcAG=N
. Is this correct?
Yes, it is.
Do you think I can create an OCN for the river network that I can derive from a DEM for that particular catchment area (e.g., using QGIS, etc.)? Or are you aware of any other method to approximate the OCN object for a real river catchment area?
I do have an R code (under development) that extracts a river network from a DEM and produces an object that is equivalent to an OCN, so that the eDITH model can be used therein. You can contact me in private for further information.
Hello,
I am exploring the eDITH model, and I followed the simulation exercise described in the 'How to design optimal eDNA sampling strategies for biomonitoring in river networks' paper where you simulate the OCN (with
create_OCN
function) which can then be used to get a simulated DNA production ratep_sim
.However, I'm trying to understand how to fit the eDITH model on observed data to predict the DNA concentration within the OCN. Just like in the case of the simulated OCN and
p_sim
, you can use the eDITH functionseval_conc
andeDITH
to get the DNA concentration, and my understanding is that you need to use the observed OCN (potentially derived from a DEM or using local information about the hydrology and geomorphology) and the observed DNA production ratep_obs
as arguments. Is my understanding correct? Or how would you use the DNA sequence counts data with the eDITH model?And one last question. I was looking at step 4 in Figure 2 from the paper I mentioned above which states that the eDITH model can be fitted on the observed DNA concentration
C_obs
to get the modelled DNA production ratep_mod
. Is this just as simple as extracting the unknown DNA production rate variable from equation 1 in the paper?