Closed aminorberg closed 3 years ago
Hi Anna, Thank you for your email and for taking an interest in the package. I'm sorry I haven't replied to that request, I thought that Github would alert me when an issue was raised but it apparently does not do that, so I haven't seen it until now. I would say that if there is a skew in the coefficient estimates then it could be possible that the mean is outside the 5% and 95% quantiles. The mean can be heavily influenced by outliers, whereas the quantiles are not (https://www.clinfo.eu/mean-median/). Can you give me some more information on your bootstrapping strategy? For example, it may be necessary to perform more iterations or to alter the proportion of the data that you sample.
I'm actually working on incorporating uncertainties into parameter estimates to overcome the issues of bootstrapping on small data, but this is just a work in progress at the moment. Unfortunately glmnet (the workhorse of MRFcov) is not a typical linear regression algorithm, but a penalization algorithm that induces some bias and makes it very challenging to estimate coefficient standard errors. I think bootstrapping is probably still the most intuitive way to get around this but in cases where there may be skew and it may be challenging to get robust estimates by downsampling the data (i.e. every bootstrap run can give fairly different results when the dataset is is not very large to begin with), you could instead shuffle the data but keep the number of observations the same, and then try to simulate uncertainty statistics by assuming coefficients are drawn from a multivariate normal distribution parameterised by the coefficient variance-covariance matrix (this is appropriate for simulating predictions from linear models, so using it here is not too much of a stretch; https://fromthebottomoftheheap.net/2014/06/16/simultaneous-confidence-intervals-for-derivatives/). I'm hoping to have this incorporated in the near future.
This can feasibly be done using the code below, though I am hesitant to include it into the package formally for the reasons specified in the comments at the end of the code block:
library(MRFcov) library(dplyr) data("Bird.parasites")
To estimate uncertainties in coefficients, first, we have to fit many models (at least 10?) to
generate estimates of the coefficient parameter covariance matrix
testing <- lapply(seq(1, 10), function(x){
Bootstrap resampling of the data in each iteration by simply reshuffling the observations
- crf <- MRFcov(Bird.parasites[sample(1:nrow(Bird.parasites),
- replace = F),], n_nodes = 4,
- family = 'binomial', n_cores = 1, progress_bar = F)
Return a list of estimated parameters for each species in the data
- apply(crf$direct_coefs, 1, list)
- }) Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ... Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ...
Calculate the vcv matrix for each species in the data and generate 100 simulated estimates
of coefficients by drawing from a multivariate normal distribution (assumed to be appropriate
given that this is a generalised linear regression problem)
sim_paths <- lapply(seq_along(names(testing[[1]])), function(x){
- sp_list <- testing %>%
- purrr::map(x) %>%
- purrr::flatten()
- full_param_mat <- do.call(rbind, sp_list)
The vcv matrix
- vcv_mat <- cov(full_param_mat)
The parameter means
- mus <- apply(full_param_mat, 2, mean)
Simulate coefficients from a multivariate normal distribution
- sim <- MASS::mvrnorm(100, mu = mus, Sigma = vcv_mat)
- }) names(sim_paths) <- names(testing[[1]])
Check some of the simulated estimates for species 1 (Hzosteropis)
head(sim_paths[[1]]) Intercept Hzosteropis Hkillangoi Plas Microfilaria scale.prop.zos [1,] -1.0034903 -7.152580e-15 -2.841025 -0.3310277 0.9089574 -0.9394247 [2,] -1.0161602 4.641943e-14 -2.506783 -0.3221955 0.9146944 -0.8929493 [3,] -0.9989029 -1.197097e-13 -3.169487 -0.3415931 0.9053691 -0.9522361 [4,] -1.0108858 1.844570e-13 -2.405123 -0.3219035 0.9130751 -0.9151168 [5,] -1.0207495 -3.680488e-15 -2.200761 -0.3153974 0.9175187 -0.8789503 [6,] -1.0180800 1.133497e-13 -1.921294 -0.3277349 0.9173075 -0.8929733 scale.prop.zos_Hzosteropis scale.prop.zos_Hkillangoi scale.prop.zos_Plas [1,] 2.545580e-20 0.6990579 -0.1806504 [2,] -8.387667e-21 0.2118538 -0.1605244 [3,] 7.958944e-21 1.6623147 -0.2017405 [4,] 9.982446e-21 -0.1452001 -0.1612446 [5,] -4.945936e-21 -0.4933718 -0.1477781 [6,] -9.507226e-21 -1.2092291 -0.1650154 scale.prop.zos_Microfilaria [1,] -1.080913 [2,] -1.031686 [3,] -1.103129 [4,] -1.049996 [5,] -1.011738 [6,] -1.019309
Mean estimates for the coefficients for species 1 (Hzosteropis)
round(apply(sim_paths[[1]], 2, mean), 4) Intercept Hzosteropis Hkillangoi -1.0085 0.0000 -2.7238 Plas Microfilaria scale.prop.zos -0.3310 0.9108 -0.9200 scale.prop.zos_Hzosteropis scale.prop.zos_Hkillangoi scale.prop.zos_Plas 0.0000 0.6727 -0.1788 scale.prop.zos_Microfilaria -1.0625
Quantiles of estimates for the coefficients for species 1 (Hzosteropis)
round(apply(sim_paths[[1]], 2, quantile), 4) Intercept Hzosteropis Hkillangoi Plas Microfilaria scale.prop.zos 0% -1.0359 0 -3.6314 -0.3857 0.8963 -1.0388 25% -1.0154 0 -2.9895 -0.3424 0.9069 -0.9449 50% -1.0081 0 -2.7734 -0.3308 0.9106 -0.9214 75% -1.0012 0 -2.5063 -0.3204 0.9138 -0.8942 100% -0.9767 0 -1.3835 -0.2826 0.9258 -0.8237 scale.prop.zos_Hzosteropis scale.prop.zos_Hkillangoi scale.prop.zos_Plas 0% 0 -2.1970 -0.2754 25% 0 0.2327 -0.2004 50% 0 0.7457 -0.1791 75% 0 1.3081 -0.1608 100% 0 2.5286 -0.0948 scale.prop.zos_Microfilaria 0% -1.1868 25% -1.0929 50% -1.0639 75% -1.0365 100% -0.9508
Propagating these uncertaintes to prediction
In each prediction step, we need to take the same row from each
simulation matrix and use that as the 'direct_coefs' slot in the model
Function to generate a dataframe from each simulated set of CRF coefficients
generate_simcoefs = function(simulations, row_index){
- do.call(rbind, lapply(simulations, function(x){
- x[row_index, ]
- }))
- }
Convert the simulations into separate 'direct_coefs' dataframes for prediction intervals
sim_coefs <- lapply(seq_len(nrow(sim_paths[[1]])), function(j){
- generate_simcoefs(simulations = sim_paths, row_index = j)
- })
Check one of the simulated coefficient slots, which will be used for prediction below
sim_coefs[[1]] Intercept Hzosteropis Hkillangoi Plas Microfilaria scale.prop.zos Hzosteropis -1.003490 -7.152580e-15 -2.841025e+00 -3.310277e-01 9.089574e-01 -0.9394247 Hkillangoi -1.629084 -1.467436e+00 1.158869e-10 8.379938e-03 -9.877206e-01 -0.8150480 Plas -1.574132 -3.434152e-01 -1.224450e-02 -3.904241e-15 1.557861e+00 -0.9994548 Microfilaria -4.049569 9.158122e-01 -9.602533e-01 1.520180e+00 -3.762592e-10 -1.1184028 scale.prop.zos_Hzosteropis scale.prop.zos_Hkillangoi scale.prop.zos_Plas Hzosteropis 2.545580e-20 0.69905789 -0.180650433 Hkillangoi -2.121886e+00 0.00000000 0.008593823 Plas -1.969776e-01 -0.01178616 0.000000000 Microfilaria -1.028949e+00 0.00000000 0.413485778 scale.prop.zos_Microfilaria Hzosteropis -1.0809131 Hkillangoi 0.0000000 Plas 0.4348382 Microfilaria 0.0000000
Try generating some prediction intervals to see that it works
First we need a single fit of the crf so that all other slots (besides the direct_coefs slot)
are properly structured for the prediction function to work
crf <- MRFcov(Bird.parasites, n_nodes = 4,
- family = 'binomial', n_cores = 1) Leave-one-out cv used for the following low-occurrence (rare) nodes: Microfilaria ... Fitting MRF models in sequence using 1 core ...
In each iteration, replace the direct_coefs slot with a simulated path and generate
probability predictions
sim_preds <- lapply(seq_len(nrow(sim_paths[[1]])), function(j){
- new_crf <- crf
- new_crf$direct_coefs <- sim_coefs[[j]]
- preds <- predict_MRF(data = Bird.parasites, MRF_mod = new_crf)$Probability_predictions
- rm(new_crf)
- preds
- })
Function for calculating prediction intervals
generate_pred_intervals = function(predictions, row_index){
- do.call(rbind, lapply(predictions, function(x){
- x[row_index, ]
- }))
- }
Summarise the predictions into a unique dataframe per observation to calculate quantiles
pred_quantiles <- lapply(seq_len(nrow(Bird.parasites)), function(j){
- observation_pred <- generate_pred_intervals(predictions = sim_preds, row_index = j)
- apply(observation_pred, 2, quantile)
- })
Pred quantiles is a list, one slot for each observation in the data, with the uncertanties in
probability predictions stored. View one of them
pred_quantiles[[100]] Hzosteropis Hkillangoi Plas Microfilaria 0% 0.52870 0.000100 0.36920 0.425000 25% 0.55275 0.002000 0.40910 0.453200 50% 0.56190 0.007550 0.41990 0.463950 75% 0.56980 0.041325 0.43135 0.474525 100% 0.60100 0.715400 0.46550 0.503400
The truth at observation 100
Bird.parasites[100, 1:4]
A tibble: 1 x 4
Hzosteropis Hkillangoi Plas Microfilaria
1 1 0 0 0 # Note this is still only an approximation that ignores the bias resulting from the # LASSO penalization. In other words, we are no longer looking for the least squares coefficients # that minimize the squared error loss. Instead, we are finding the best predictive estimates # subject to the regularization of the model. In our case, this is further magnified by the # fact that these estimates are symmetrised versions that were originally drawn from different # LASSO models that likely were under different amounts of penalization. At best, these # confidence intervals should be interpreted as uncertainty surrounding a variable's predictive # information content, rather than a statistical property of the distribution describing the # true relationship between that variable and the outcome
Hi,
I've fitted a CRF and I'm trying to figure out my results regarding the significance of my associations between species. With 'bootstrap_MRF' and 'plotMRF_hm' I get the confidence intervals, but I do not understand how (some of) the means can be outside the interval? I'm sorry if this is very trivial, I'm new to this method.
BR,
Anna