traitecoevo / mortality_bci

Tropical tree mortality model using Barro Colorado Island as a case study
Other
7 stars 3 forks source link

Add term for growth measurement error? #46

Closed dfalster closed 9 years ago

dfalster commented 9 years ago

BCI team has some estimates for how big error in dbh is from repeat measures

dfalster commented 9 years ago
jscamac commented 9 years ago

I've thought a little more on this. This should be fairly straight forward to implement, but it will assume the error does not vary with size or species.

@dfalster: Where is this repeated measure data?

dfalster commented 9 years ago

@jscamac See Ruger et al 2011 DOI:10.1371/journal.pone.0025330 - they report estimates for the error distribution, and have already implemented an observation model for dbh growth in Bayesian framework.

@MarcoDVisser reports that normal distribution doesn't provide a great fit to data because there is a spike in zero measurements. For that reason he's proposed the Zero Inflated Laplace Distribution. However there isn't a published description of ZIL available yet. Also the non-std distribution may not play so well with your statistical analysis.

On 27 April 2015 at 12:46, James Camac notifications@github.com wrote:

I've thought a little more on this. This should be fairly straight forward to implement, but it will assume the error does not vary with size or species.

— Reply to this email directly or view it on GitHub https://github.com/traitecoevo/mortality_bci/issues/46#issuecomment-96478879 .

jscamac commented 9 years ago

Following Ruger et al 2011 approach, where they fit the differences between measured dbh and true dbh as a sum of two normal distributions. The first normal distribution describes small errors that are proportional to the dbh of the tree. The second is independent of tree dbh and describes larger errors, e.g. due to errors in decimal places or recording dbh for the wrong tree.
I've had to make slight changes because we are not modelling growth as a function of covariates (whereas they are). As such I model predicted growth as a function of the grand mean and associated standard deviation... see below

log_pred ~ normal(mu, sigma) where: mu ~ stan's default improper prior sigma ~ cauchy(0,2.5)T[0,] (half cauchy)

true_gr ~ lognormal(log_pred, sigma_sp) where sigma_sp varies by species and is sampled from: sigma_sp[s] ~ cauchy(0,2.5)T[0,]

The observed growth (which contains negative values) is modelled as a mixture of two normal distributions. The first describes the size dependent error, the second describes the size independent error.

obs_gr ~ (1- f) x normal(true_gr, (sda + sdb * dbh[i])/census_length[i]) + f x normal(true_gr, SD2/census_length)

f, sda, sdb and SD2 are obtained directly from ruger et al (but rescaled in meters) and is based on data explicitly collected to examine measurement error in dbh/growth.

The sigmas of each normal distribution are adjusted by census_length, the time period elapsed between the two dbh measurements of the tree.

Model appears to be working fine. and obs vs pred plots look pretty good.

jscamac commented 9 years ago

Damn spoke too soon. I've just realised that Ruger's model is contains fixed parameters that are specific to dbh_dt growth. I've sent an email to see if I can get hold of the raw data (or parameter estimates for the other types of growth).

jscamac commented 9 years ago

Ok I have managed to get hold of the remeasurement data. It is actually available here: https://repository.si.edu/bitstream/handle/10088/20925/bci_31August2012_mysql.zip?sequence=12&isAllowed=y

which includes the Remeasurement table, as described here

https://repository.si.edu/bitstream/handle/10088/20863/CTFSDataModelDocumentation.html?sequence=1&isAllowed=y#Remeasurement

We should cite the archive (find it following my website, richardcondit.org/main).

Only issue is that it will require a fair bit of digging around in MYSQL. In the meantime Nadja Ruger will send me the actual data file used in her paper.

She has also sent her code. Though I'm struggling to decipher it. Most of it is written using the MCMCpack and coda.

jscamac commented 9 years ago

I've spent that last couple of days playing around with mixture models - the models used by Ruger, Chave and Condit to model measurement error - See above for description.

Just so I could understand these models more I've simulated data and attempted to recover the sigma and mixing proportions using the following code:


n_obs <-1000
dbh <- rlnorm(n_obs, 3.443309, 0.9253412) #mock data
sda <- 0.927
sdb <- 0.0038
SD2 <- 25.6
theta <- 0.028

sim <- c(rnorm(n_obs*(1-theta), 0, sda + sdb*dbh), rnorm(n_obs*theta, 0, SD2))

model <- '
data {
int<lower=1> n_obs;
vector[n_obs] sim;
vector[n_obs] dbh;
}
parameters { 
simplex[2] theta;
positive_ordered[3] sigma;
}

model {
theta ~ uniform(0,1);
sigma[1] ~ uniform(0,1);
sigma[2] ~ uniform(0,10);
sigma[3] ~ uniform(0,100);

  for (i in 1:n_obs) {
    increment_log_prob(log_sum_exp(log(theta[2]) + normal_log(sim[i], 0, sigma[1] * dbh[i] + sigma[2]),
                                   log(theta[1]) + normal_log(sim[i], 0, sigma[3])));  
  }
}'

mod <- stan(model_code = model, data = list(n_obs=n_obs, dbh=dbh, sim=sim), 
            pars = c('sigma','theta'), chains = 1, iter = 500, control=list(adapt_delta=0.9))

Let's just say these types of models are extremely frustrating!

The problem I've found is that because the locations are the same (i.e. true = 0) and only the sigma's and proportions change... the parameters are not identifiable and switching indices occur. It also appears the parameter estimates are extremely sensitive to the choice of prior. So I've given up trying to replicate their error model.

jscamac commented 9 years ago

A mixture model that does seem to work fine is one that doesn't attempt to partition error into size dependent and size independent. Instead it more generally describes the summed distribution of small measurement errors and large measurement errors, irrespective of size.

This might be worth using as the size effect (sdb) reported in Ruger et al 2011 and Chave 2004 is was 0.0036 (on mm scale) and 0.0062 (on cm scale) respectively.

Mixed model that just partitions error between small and large

dbh <- rlnorm(n_obs, 3.443309, 0.9253412) #mock data
SD1 <- 0.2
SD2 <- 25.6
theta <- 0.028

sim <- c(rnorm(n_obs*(1-theta), 0, SD1), rnorm(n_obs*theta, 0, SD2))

model <- '
data {
int<lower=1> n_obs;
vector[n_obs] sim;
vector[n_obs] dbh;
}
parameters { 
simplex[2] theta;
positive_ordered[2] sigma;
}

model {

  for (i in 1:n_obs) {
    increment_log_prob(log_sum_exp(log(theta[2]) + normal_log(sim[i], 0, sigma[1]),
                                   log(theta[1]) + normal_log(sim[i], 0, sigma[2])));  
  }
}'

mod <- stan(model_code = model, data = list(n_obs=n_obs, dbh=dbh, sim=sim), 
            pars = c('sigma','theta'), chains = 3, iter = 500, control=list(adapt_delta=0.9))

Output:

Inference for Stan model: model.
3 chains, each with iter=500; warmup=250; thin=1; 
post-warmup draws per chain=250, total post-warmup draws=750.

           mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
sigma[1]   0.20    0.00 0.00   0.19   0.19   0.20   0.20   0.21   569 1.00
sigma[2]  25.92    0.17 3.51  19.82  23.44  25.61  28.13  32.84   416 1.00
theta[1]   0.03    0.00 0.00   0.02   0.03   0.03   0.03   0.04   436 1.00
theta[2]   0.97    0.00 0.00   0.96   0.97   0.97   0.97   0.98   436 1.00
lp__     -56.70    0.06 1.10 -59.70 -57.18 -56.44 -55.92 -55.48   325 1.01

Samples were drawn using NUTS(diag_e) at Tue Jun 23 16:15:47 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
jscamac commented 9 years ago

Some extra thoughts on how to tackle using this model for other growth measures.

growth model

t_dbh1 ~ lognormal(0, 100)
GR ~ lognormal(0, 100);
t_dbh2 <- tdbh1 + GR
obs_dbh1 ~ (1-f) * N(t_dbh1, SD1) + f * N(t_dbh1, SD2) #where f, SD1 and SD2 are estimated using remeasurement data
obs_dbh2 ~ (1-f) * N(t_dbh2, SD1) + f * N(t_dbh2, SD2)

.....

basal_area1 <- 0.25*pi*(t_dbh1)^2
basal_area2 <- 0.25*pi*(t_dbh2)^2
dbh_gr <- GR/census_length # (t_dbh2 - t_dbh1)/census_length
basal_gr <- (basal_area2 - basal_area1)/census_length
dbh_gr_rel <- (log(t_dbh2) - log(t_dbh1))/census_length
basal_gr_rel <- (log(basal_area2) - log(basal_area1))/census_length

}

This could be implemented within the mortality model or potentially estimated outside and means/median growth rate used.

This should account for measurement error in dbh and force growth to be positive.

jscamac commented 9 years ago

Managed to implement the full error model using the idea described above. The stand alone model can be found here

For measurement error model incorporated into mortality model see here