Extremely large predicted values when using 'computePredictedValues' for cross validation

smithja16 commented 1 year ago

Hi there,

I'm currently using a hurdle model to model biomass. When I use 'computePredictedValues' to run 5-fold cross validation, I get absurdly large values for the abundance-only component. As large as 1e+300, when they should be a maximum of ~100. Because I log transform the response, before I use exp() the range of the response is around -5000 to 5000.

This only happens when using spatial latent variables, and I get correctly-sized values when I use non-spatial latent variables. I also get correctly-sized values when I use 'computePredictedValues' to estimate the 'fitted values' (i.e. no partition).

What might be causing this? The only thing I can think of is some crazy extrapolation of the spatial grid (image below), although the problem isn't limited to some observations - all observations are predicted to have absurdly large biomass. Also, would it matter that I use 'latitude' as a fixed effect as well as a component of the spatial LVs?

UPDATE: when I fit a single model, with a normal distribution (i.e. not a hurdle model), the predicted biomasses are reasonable. The model is a poorer fit, but it avoids the absurdly high predicted values, and computePredictedValues for cross validation works. So the problem may be rare species (and few positive observations in that part of the hurdle model) not playing nicely with the spatial random effects? Would some poor fitting rare species inflate all species' out-of-sample predictions though? At the moment, I'm tempted to use a single normal model and truncate negative values.

Below is some code to get an idea of my design:

Yabund <- Y Yabund[Yabund==0] <- NA Y_abund_log <- log(Y_abund) XFormula_abund = ~Latitude + x2 + poly(x3, degree=2, raw=T) + poly(x4, degree=2, raw=T) + x5

xycoords <- X[,c("Longitude", "Latitude")] Knots <- constructKnots(xycoords, knotDist = 0.2, minKnotDist = 0.4) rL_spat <- HmscRandomLevel(sData = xycoords, sMethod = 'GPP', sKnot = Knots)

studyDesign <- data.frame(sample=as.factor(1:nrow(X)), vessel=droplevels(X$vessel)) rL_v <- HmscRandomLevel(units=studyDesign$vessel)

M_abund <- Hmsc(Y=Y_abund_log, XData=X, XFormula=XFormula_abund, studyDesign=studyDesign, ranLevels=list(sample=rL_spat, vessel=rL_v), distr="normal", YScale=TRUE)

M_abund <- sampleMcmc(M_abund, thin=20, samples=1000, transient=10000, nChains=3, nParallel=3)

partition = createPartition(hM = M_abund, nfolds = 5, column = "sample") predsA = computePredictedValues(M_abund, partition=partition, expected=TRUE) plot(colMeans(exp(predsSA[,,1]))) #absurdly high values ex <- evaluateModelFit(M_abund, predsA) ex$RMSE #absurdly high values

Spatial grid

gtikhonov commented 1 year ago

Hi! Thanks for the detailed description of the issue you have faced. Given that the problem manifests in the hurdle model, but not in the model without NAs in Y matrix, I would suspect that this may be related to one of modifications that we introduced after the package release to enhance the mixing for cases with larger proportions of NAs in Y matrix. Maybe we introduced some bug alongside the intended improvement.

However, before I jump to scrutinise the package code, I would like to ask you to report some more diagnostics. To begin with, could you tell, whether this misbehaviour with enormous predicted values happens for certain elements of partition once you run the computePredictedValues(...) method? Or whether you get them for all elements of partition?

smithja16 commented 1 year ago

Hi. Thanks for getting back to me.

I've looked at partition level predictions, and all partitions have very large values, although for one model partition 1 had much lower values (max 1e+14, compared to 1e+100 for other partitions).

But I've just noticed something odd:

partition = createPartition(hM = Mabund, nfolds = 5, column = "sample") predsA = computePredictedValues(M_abund, partition=partition, expected=TRUE) predsA_mean <- exp(apply(predsA, c(1,2), mean)) #this gives mostly normal sized values! predsA_mean2 <- apply(exp(predsA), c(1,2), mean) #this gives absurd sized values

So, when I'm back-calculating the log transform, if I do that after I take the mean of all the samples, I get mostly normal sized predictions, but if I do the back-transform on individual samples I get crazy high predictions. So an individual sample is not accurate, but their mean before back-transforming is? Could that arise when the chains are mixing correctly about the 'true' estimate, but their magnitude is way off?

ovaskain commented 1 year ago

Hi,

“Extremely large values in predictions” happens quite easily in models that involve the log-transformation that is counteracted by the exp-transformation, especially if those models involve a lot of parameter uncertainty / residual variation. This is because unusual values of linear predictor, generated by the tails of the distributions, can lead to extremely large values after the exp-transformation, which makes posterior mean a bad summary of the predictions. My recommendation would be to use in such cases posterior median instead of posterior mean, as that is not influenced by such outliers in the predictive distribution.

O2

From: James Smith @.> Sent: keskiviikko 8. marraskuuta 2023 0:56 To: hmsc-r/HMSC @.> Cc: Subscribed @.***> Subject: Re: [hmsc-r/HMSC] Extremely large predicted values when using 'computePredictedValues' for cross validation (Issue #169)

Hi. Thanks for getting back to me.

I've looked at partition level predictions, and all partitions have very large values, although partition 1 has much lower values (max 1e+14, compared to 1e+100 for other partitions).

But I've just noticed something odd:

partition = createPartition(hM = Mabund, nfolds = 5, column = "sample") predsA = computePredictedValues(M_abund, partition=partition, expected=TRUE) predsA_mean <- exp(apply(predsA, c(1,2), mean)) #this gives normal sized values! predsA_mean2 <- apply(exp(predsA), c(1,2), mean) #this gives absurd sized values

So, when I'm back-calculating the log transform, if I do that after I take the mean of all the samples, I get normal sized predictions, but if I do the back-transform on individual samples I get crazy high predictions. So an individual sample is not accurate, but their mean before back-transforming is?

— Reply to this email directly, view it on GitHubhttps://github.com/hmsc-r/HMSC/issues/169#issuecomment-1800330798, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEIYMZXLG4DMFLRX5ORN3YDYDK37FAVCNFSM6AAAAAA66UOVT6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBQGMZTANZZHA. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

smithja16 commented 1 year ago

Thanks for the advice - using the median makes sense. For some of my models it largely solves the problem, for others - perhaps those with more complexity - the problem of very large values persists when using the median. Perhaps longer chains will help, but perhaps I'm also asking too much of the data?

On a related note, which of the following do you think makes the most sense when multiplying the two parts of the hurdle model together to estimate total biomass: 1) calculate the mean or median from the samples for both hurdle components, then multiply them; or 2) multiply them, sample by sample, then take the mean or median of those numerous totals?

Hi, “Extremely large values in predictions” happens quite easily in models that involve the log-transformation that is counteracted by the exp-transformation, especially if those models involve a lot of parameter uncertainty / residual variation. This is because unusual values of linear predictor, generated by the tails of the distributions, can lead to extremely large values after the exp-transformation, which makes posterior mean a bad summary of the predictions. My recommendation would be to use in such cases posterior median instead of posterior mean, as that is not influenced by such outliers in the predictive distribution. O2 From: James Smith @.> Sent: keskiviikko 8. marraskuuta 2023 0:56 To: hmsc-r/HMSC @.> Cc: Subscribed @.> Subject: Re: [hmsc-r/HMSC] Extremely large predicted values when using 'computePredictedValues' for cross validation (Issue #169) Hi. Thanks for getting back to me. I've looked at partition level predictions, and all partitions have very large values, although partition 1 has much lower values (max 1e+14, compared to 1e+100 for other partitions). But I've just noticed something odd: partition = createPartition(hM = Mabund, nfolds = 5, column = "sample") predsA = computePredictedValues(M_abund, partition=partition, expected=TRUE) predsA_mean <- exp(apply(predsA, c(1,2), mean)) #this gives normal sized values! predsA_mean2 <- apply(exp(predsA), c(1,2), mean) #this gives absurd sized values So, when I'm back-calculating the log transform, if I do that after I take the mean of all the samples, I get normal sized predictions, but if I do the back-transform on individual samples I get crazy high predictions. So an individual sample is not accurate, but their mean before back-transforming is? — Reply to this email directly, view it on GitHub<#169 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEIYMZXLG4DMFLRX5ORN3YDYDK37FAVCNFSM6AAAAAA66UOVT6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBQGMZTANZZHA. You are receiving this because you are subscribed to this thread.Message ID: @*.**@*.***>>

ovaskain commented 1 year ago

I think the following is most natural: Generate a realization of both models from a single posterior sample (0/1 for p-a and actual count for abundance conditional on presence), and multiply them to get a prediction of count (which may be zero). Repeat this over the posterior distribution (and realization stochasticity) to get a predictive distribution of counts, and then summarize that in terms of medians etc. Note that the two posteriors are not connected because the models are fitted separately but you can still e.g. use sample 1 from p/a and sample 1 from abuc when doing the above.

O2

From: James Smith @.> Sent: torstai 9. marraskuuta 2023 4:54 To: hmsc-r/HMSC @.> Cc: Ovaskainen, Otso T @.>; Comment @.> Subject: Re: [hmsc-r/HMSC] Extremely large predicted values when using 'computePredictedValues' for cross validation (Issue #169)

Thanks for the advice - using the median makes sense. For some of my models it largely solves the problem, for others - perhaps those with more complexity - the problem of very large values persists when using the median. Perhaps longer chains will help, but perhaps I'm also asking too much of the data?

On a related note, which of the following do you think makes the most sense when multiplying the two parts of the hurdle model together to estimate total biomass: 1) calculate the mean or median from the samples for both hurdle components, then multiply them; or 2) multiply them, sample by sample, then take the mean or median of those numerous totals?

Hi, “Extremely large values in predictions” happens quite easily in models that involve the log-transformation that is counteracted by the exp-transformation, especially if those models involve a lot of parameter uncertainty / residual variation. This is because unusual values of linear predictor, generated by the tails of the distributions, can lead to extremely large values after the exp-transformation, which makes posterior mean a bad summary of the predictions. My recommendation would be to use in such cases posterior median instead of posterior mean, as that is not influenced by such outliers in the predictive distribution. O2 From: James Smith @.> Sent: keskiviikko 8. marraskuuta 2023 0:56 To: hmsc-r/HMSC @.> Cc: Subscribed @.> Subject: Re: [hmsc-r/HMSC] Extremely large predicted values when using 'computePredictedValues' for cross validation (Issue #169https://github.com/hmsc-r/HMSC/issues/169) Hi. Thanks for getting back to me. I've looked at partition level predictions, and all partitions have very large values, although partition 1 has much lower values (max 1e+14, compared to 1e+100 for other partitions). But I've just noticed something odd: partition = createPartition(hM = Mabund, nfolds = 5, column = "sample") predsA = computePredictedValues(M_abund, partition=partition, expected=TRUE) predsA_mean <- exp(apply(predsA, c(1,2), mean)) #this gives normal sized values! predsA_mean2 <- apply(exp(predsA), c(1,2), mean) #this gives absurd sized values So, when I'm back-calculating the log transform, if I do that after I take the mean of all the samples, I get normal sized predictions, but if I do the back-transform on individual samples I get crazy high predictions. So an individual sample is not accurate, but their mean before back-transforming is? — Reply to this email directly, view it on GitHub<#169 (comment)https://github.com/hmsc-r/HMSC/issues/169#issuecomment-1800330798>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEIYMZXLG4DMFLRX5ORN3YDYDK37FAVCNFSM6AAAAAA66UOVT6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBQGMZTANZZHA. You are receiving this because you are subscribed to this thread.Message ID: @.@.***>>

— Reply to this email directly, view it on GitHubhttps://github.com/hmsc-r/HMSC/issues/169#issuecomment-1803083552, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEIYMZUH3QZSEJQBS7SD6B3YDRAVZAVCNFSM6AAAAAA66UOVT6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBTGA4DGNJVGI. You are receiving this because you commented.Message ID: @.**@.>>

hmsc-r / HMSC

Extremely large predicted values when using 'computePredictedValues' for cross validation #169