Checking coefficient uncertainty

aminorberg commented 4 years ago

Hi,

I've fitted a CRF and I'm trying to figure out my results regarding the significance of my associations between species. With 'bootstrap_MRF' and 'plotMRF_hm' I get the confidence intervals, but I do not understand how (some of) the means can be outside the interval? I'm sorry if this is very trivial, I'm new to this method.

BR,

Anna

nicholasjclark commented 4 years ago

Hi Anna, Thank you for your email and for taking an interest in the package. I'm sorry I haven't replied to that request, I thought that Github would alert me when an issue was raised but it apparently does not do that, so I haven't seen it until now. I would say that if there is a skew in the coefficient estimates then it could be possible that the mean is outside the 5% and 95% quantiles. The mean can be heavily influenced by outliers, whereas the quantiles are not (https://www.clinfo.eu/mean-median/). Can you give me some more information on your bootstrapping strategy? For example, it may be necessary to perform more iterations or to alter the proportion of the data that you sample.

I'm actually working on incorporating uncertainties into parameter estimates to overcome the issues of bootstrapping on small data, but this is just a work in progress at the moment. Unfortunately glmnet (the workhorse of MRFcov) is not a typical linear regression algorithm, but a penalization algorithm that induces some bias and makes it very challenging to estimate coefficient standard errors. I think bootstrapping is probably still the most intuitive way to get around this but in cases where there may be skew and it may be challenging to get robust estimates by downsampling the data (i.e. every bootstrap run can give fairly different results when the dataset is is not very large to begin with), you could instead shuffle the data but keep the number of observations the same, and then try to simulate uncertainty statistics by assuming coefficients are drawn from a multivariate normal distribution parameterised by the coefficient variance-covariance matrix (this is appropriate for simulating predictions from linear models, so using it here is not too much of a stretch; https://fromthebottomoftheheap.net/2014/06/16/simultaneous-confidence-intervals-for-derivatives/). I'm hoping to have this incorporated in the near future.

nicholasjclark commented 3 years ago

This can feasibly be done using the code below, though I am hesitant to include it into the package formally for the reasons specified in the comments at the end of the code block:

library(MRFcov) library(dplyr) data("Bird.parasites")

To estimate uncertainties in coefficients, first, we have to fit many models (at least 10?) to

generate estimates of the coefficient parameter covariance matrix

testing <- lapply(seq(1, 10), function(x){

Bootstrap resampling of the data in each iteration by simply reshuffling the observations

crf <- MRFcov(Bird.parasites[sample(1:nrow(Bird.parasites),

replace = F),], n_nodes = 4,

family = 'binomial', n_cores = 1, progress_bar = F)

Return a list of estimated parameters for each species in the data

apply(crf$direct_coefs, 1, list)

}) Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ...

Calculate the vcv matrix for each species in the data and generate 100 simulated estimates

of coefficients by drawing from a multivariate normal distribution (assumed to be appropriate

given that this is a generalised linear regression problem)

sim_paths <- lapply(seq_along(names(testing[[1]])), function(x){

sp_list <- testing %>%

purrr::map(x) %>%

purrr::flatten()

full_param_mat <- do.call(rbind, sp_list)

The vcv matrix

vcv_mat <- cov(full_param_mat)

The parameter means

mus <- apply(full_param_mat, 2, mean)

Simulate coefficients from a multivariate normal distribution

sim <- MASS::mvrnorm(100, mu = mus, Sigma = vcv_mat)

}) names(sim_paths) <- names(testing[[1]])

Check some of the simulated estimates for species 1 (Hzosteropis)

head(sim_paths[[1]]) Intercept Hzosteropis Hkillangoi Plas Microfilaria scale.prop.zos [1,] -1.0034903 -7.152580e-15 -2.841025 -0.3310277 0.9089574 -0.9394247 [2,] -1.0161602 4.641943e-14 -2.506783 -0.3221955 0.9146944 -0.8929493 [3,] -0.9989029 -1.197097e-13 -3.169487 -0.3415931 0.9053691 -0.9522361 [4,] -1.0108858 1.844570e-13 -2.405123 -0.3219035 0.9130751 -0.9151168 [5,] -1.0207495 -3.680488e-15 -2.200761 -0.3153974 0.9175187 -0.8789503 [6,] -1.0180800 1.133497e-13 -1.921294 -0.3277349 0.9173075 -0.8929733 scale.prop.zos_Hzosteropis scale.prop.zos_Hkillangoi scale.prop.zos_Plas [1,] 2.545580e-20 0.6990579 -0.1806504 [2,] -8.387667e-21 0.2118538 -0.1605244 [3,] 7.958944e-21 1.6623147 -0.2017405 [4,] 9.982446e-21 -0.1452001 -0.1612446 [5,] -4.945936e-21 -0.4933718 -0.1477781 [6,] -9.507226e-21 -1.2092291 -0.1650154 scale.prop.zos_Microfilaria [1,] -1.080913 [2,] -1.031686 [3,] -1.103129 [4,] -1.049996 [5,] -1.011738 [6,] -1.019309

Mean estimates for the coefficients for species 1 (Hzosteropis)

round(apply(sim_paths[[1]], 2, mean), 4) Intercept Hzosteropis Hkillangoi -1.0085 0.0000 -2.7238 Plas Microfilaria scale.prop.zos -0.3310 0.9108 -0.9200 scale.prop.zos_Hzosteropis scale.prop.zos_Hkillangoi scale.prop.zos_Plas 0.0000 0.6727 -0.1788 scale.prop.zos_Microfilaria -1.0625

Quantiles of estimates for the coefficients for species 1 (Hzosteropis)

round(apply(sim_paths[[1]], 2, quantile), 4) Intercept Hzosteropis Hkillangoi Plas Microfilaria scale.prop.zos 0% -1.0359 0 -3.6314 -0.3857 0.8963 -1.0388 25% -1.0154 0 -2.9895 -0.3424 0.9069 -0.9449 50% -1.0081 0 -2.7734 -0.3308 0.9106 -0.9214 75% -1.0012 0 -2.5063 -0.3204 0.9138 -0.8942 100% -0.9767 0 -1.3835 -0.2826 0.9258 -0.8237 scale.prop.zos_Hzosteropis scale.prop.zos_Hkillangoi scale.prop.zos_Plas 0% 0 -2.1970 -0.2754 25% 0 0.2327 -0.2004 50% 0 0.7457 -0.1791 75% 0 1.3081 -0.1608 100% 0 2.5286 -0.0948 scale.prop.zos_Microfilaria 0% -1.1868 25% -1.0929 50% -1.0639 75% -1.0365 100% -0.9508

Propagating these uncertaintes to prediction

In each prediction step, we need to take the same row from each

simulation matrix and use that as the 'direct_coefs' slot in the model

Function to generate a dataframe from each simulated set of CRF coefficients

generate_simcoefs = function(simulations, row_index){

do.call(rbind, lapply(simulations, function(x){

x[row_index, ]

}))

}

Convert the simulations into separate 'direct_coefs' dataframes for prediction intervals

sim_coefs <- lapply(seq_len(nrow(sim_paths[[1]])), function(j){

generate_simcoefs(simulations = sim_paths, row_index = j)

})

Check one of the simulated coefficient slots, which will be used for prediction below

sim_coefs[[1]] Intercept Hzosteropis Hkillangoi Plas Microfilaria scale.prop.zos Hzosteropis -1.003490 -7.152580e-15 -2.841025e+00 -3.310277e-01 9.089574e-01 -0.9394247 Hkillangoi -1.629084 -1.467436e+00 1.158869e-10 8.379938e-03 -9.877206e-01 -0.8150480 Plas -1.574132 -3.434152e-01 -1.224450e-02 -3.904241e-15 1.557861e+00 -0.9994548 Microfilaria -4.049569 9.158122e-01 -9.602533e-01 1.520180e+00 -3.762592e-10 -1.1184028 scale.prop.zos_Hzosteropis scale.prop.zos_Hkillangoi scale.prop.zos_Plas Hzosteropis 2.545580e-20 0.69905789 -0.180650433 Hkillangoi -2.121886e+00 0.00000000 0.008593823 Plas -1.969776e-01 -0.01178616 0.000000000 Microfilaria -1.028949e+00 0.00000000 0.413485778 scale.prop.zos_Microfilaria Hzosteropis -1.0809131 Hkillangoi 0.0000000 Plas 0.4348382 Microfilaria 0.0000000

Try generating some prediction intervals to see that it works

First we need a single fit of the crf so that all other slots (besides the direct_coefs slot)

are properly structured for the prediction function to work

crf <- MRFcov(Bird.parasites, n_nodes = 4,

family = 'binomial', n_cores = 1) Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ...

In each iteration, replace the direct_coefs slot with a simulated path and generate

probability predictions

sim_preds <- lapply(seq_len(nrow(sim_paths[[1]])), function(j){

new_crf <- crf

new_crf$direct_coefs <- sim_coefs[[j]]

preds <- predict_MRF(data = Bird.parasites, MRF_mod = new_crf)$Probability_predictions

rm(new_crf)

preds

})

Function for calculating prediction intervals

generate_pred_intervals = function(predictions, row_index){

do.call(rbind, lapply(predictions, function(x){

x[row_index, ]

}))

}

Summarise the predictions into a unique dataframe per observation to calculate quantiles

pred_quantiles <- lapply(seq_len(nrow(Bird.parasites)), function(j){

observation_pred <- generate_pred_intervals(predictions = sim_preds, row_index = j)

apply(observation_pred, 2, quantile)

})

Pred quantiles is a list, one slot for each observation in the data, with the uncertanties in

probability predictions stored. View one of them

pred_quantiles[[100]] Hzosteropis Hkillangoi Plas Microfilaria 0% 0.52870 0.000100 0.36920 0.425000 25% 0.55275 0.002000 0.40910 0.453200 50% 0.56190 0.007550 0.41990 0.463950 75% 0.56980 0.041325 0.43135 0.474525 100% 0.60100 0.715400 0.46550 0.503400

The truth at observation 100

Bird.parasites[100, 1:4]

A tibble: 1 x 4

Hzosteropis Hkillangoi Plas Microfilaria
1 1 0 0 0 # Note this is still only an approximation that ignores the bias resulting from the # LASSO penalization. In other words, we are no longer looking for the least squares coefficients # that minimize the squared error loss. Instead, we are finding the best predictive estimates # subject to the regularization of the model. In our case, this is further magnified by the # fact that these estimates are symmetrised versions that were originally drawn from different # LASSO models that likely were under different amounts of penalization. At best, these # confidence intervals should be interpreted as uncertainty surrounding a variable's predictive # information content, rather than a statistical property of the distribution describing the # true relationship between that variable and the outcome

nicholasjclark / MRFcov

Checking coefficient uncertainty #26

To estimate uncertainties in coefficients, first, we have to fit many models (at least 10?) to

generate estimates of the coefficient parameter covariance matrix

Bootstrap resampling of the data in each iteration by simply reshuffling the observations

Return a list of estimated parameters for each species in the data

Calculate the vcv matrix for each species in the data and generate 100 simulated estimates

of coefficients by drawing from a multivariate normal distribution (assumed to be appropriate

given that this is a generalised linear regression problem)

The vcv matrix

The parameter means

Simulate coefficients from a multivariate normal distribution

Check some of the simulated estimates for species 1 (Hzosteropis)

Mean estimates for the coefficients for species 1 (Hzosteropis)

Quantiles of estimates for the coefficients for species 1 (Hzosteropis)

Propagating these uncertaintes to prediction

In each prediction step, we need to take the same row from each

simulation matrix and use that as the 'direct_coefs' slot in the model

Function to generate a dataframe from each simulated set of CRF coefficients

Convert the simulations into separate 'direct_coefs' dataframes for prediction intervals

Check one of the simulated coefficient slots, which will be used for prediction below

Try generating some prediction intervals to see that it works

First we need a single fit of the crf so that all other slots (besides the direct_coefs slot)

are properly structured for the prediction function to work

In each iteration, replace the direct_coefs slot with a simulated path and generate

probability predictions

Function for calculating prediction intervals

Summarise the predictions into a unique dataframe per observation to calculate quantiles

Pred quantiles is a list, one slot for each observation in the data, with the uncertanties in

probability predictions stored. View one of them

The truth at observation 100

A tibble: 1 x 4