Invalid output when using 'add_biodiversity_poipo'

Daviser95 commented 2 months ago

Hi, After making some tries with classic presence/pseudo-absences datasets, I am trying to fit some models using a presence-only dataset. However, when using this option I get always strange outputs from the trainfunction, which I attached below:

issue

This is my code:

#load variables
path_vars_100<-"G:/davide/Namibia/variables_100m/selected_current" ### qui metti il percorso alla cartella dove stanno le variabili
env100 <- list.files(path_vars_100,pattern=".tif$")
#setwd()
EnvVars<-rast(lapply(env100,function(x){
  var<-rast(paste0(path_vars_100,"/",x))
  return(var)}))
envars_res <- terra::aggregate(EnvVars, fact = 5)
envars_nrm <- ibis.iSDM::predictor_transform(envars_res, "norm")
envars_ready <- ibis.iSDM::predictor_homogenize_na(envars_nrm)

background <- rast("G:/davide/Namibia/variables_100m/all_final_variables/dem_namibia_100m_wgs.tif")
background_res <- terra::aggregate(background, fact= 5)
background_res[background_res<0] <- 1
background_res[background_res>0] <- 1

#load global ensemble
Africa_shape <- st_read("F:/Polyclada_Africa/afr_g2014_2013_0.shp")
ensemble_africa <- rast("G:/davide/Namibia/africa_modeling/predictions/ens_xgball_wmean.tif")
ensemble_africa_cropped <- ensemble_africa |>
  crop(Africa_shape) |>
  mask(Africa_shape)

#load dataset
data_namibia1 <- read.csv("G:/davide/Namibia/datasets/dat1.csv") %>% 
  st_as_sf(coords=c("Longitude","Latitude"), crs=4326)
data_namibia1_f <- thin_observations(data_namibia1, background_res, method = "random", remainpoints = 1)
data_namibia1_f$pres_abs <- 1

# modeling
#dat1
mod1_xgb <- distribution(background_res) |>
  add_predictor_range(layer = range_report, distance_max = 5e4) |>
  add_predictors(ensemble_africa_cropped) |>
  add_predictors(envars_ready) |>
  add_biodiversity_poipo(data_namibia1_f, field_occurrence = "pres_abs", docheck = F) |>
  engine_xgboost(booster = "gbtree", iter = 9000L, learning_rate = 0.0005, gamma = 3, subsample = 0.85)
gc()

fit_mod1_xgb <- ibis.iSDM::train(mod1_xgb, only_linear = FALSE, verbose = TRUE, optim_hyperparam = T, runname = "xgboost: dat1 poipo + global model + range")
gc()
plot(fit_mod1_xgb, "mean")

I get this issue also when trying to use add_biodiversity_polpo, selecting some random points inside the polygons. I controlled the tutorials and the examples provided to check if there were some errors in my code, but I didn't solve this problem.

Thank you in advance! Davide

Martin-Jung commented 2 months ago

Hi, in response to your title could you please describe what would be a 'valid' output for your case (expectation-wise)? Note that you are fitting a Poisson Process model assumingly with a response output link which ranges in values from >0 to Infinity. Getting very large estimates in a few grid cells usually points towards a poorly calibrated model or strong overfitting.

Check your coefficients looking for specific outliers (effect sizes being several magnitudes larger than others). Try removing them.
Note that the default plotting capacities in the package might also skew your prediction. How it for example look if you transform the output into percentiles or a threshold?
Predicting on the response scale for PPM can be quite hard, try changing the type="link" in the engine.
You could try the clamp=TRUE parameter in train() as well.

Daviser95 commented 2 months ago

Hi @Martin-Jung, thank you for your response. Sorry for the misunderstanding in the title, but I was expecting an output similar to the one reported in the tutorial (https://iiasa.github.io/ibis.iSDM/articles/02_train_simple_model.html). So I thought there was something wrong with my code. I tried with two different datasets, one very large (6.000 presence points) and one smaller (~300 presence points), changing the parameters following your suggestions, but the results are quite similar to the one reported, even when using a threshold. The difference is that now the scale is not from 0 to Inf, but I have three discrete values (e.g., 444066.6, 444332.2, 999752.2).

Martin-Jung commented 2 months ago

The only thing I can think of otherwise is to try and adjust the number of points per grid unit (for example by switching aggregate_observations in train() to FALSE. Otherwise you could consider investigating whether

the high-value grid cells fall in ares with lots of observations. Maybe consider thinning as an option too thin_observations
Check for overly large coefficients. For example from output coefficients(fit) or summary(fit)
Potentially try a covariate scaling (parameter scale) too or a preliminary removal of too colinear variables (See filter_predictors parameter in train()).
Check also partial_density() to see where your observation fall along environmental gradients

Medium to long-term we might also want to implement some options to alter the offset (e.g. area unit scale) specifically for PPMs. This can also have an impact on the response scale.

Daviser95 commented 2 months ago

Thank you Martin, currently I have already thin the observations before running models, and the variables were filtered to remove the collinear ones. Moreover, I found the issue #102 which seems similar to my case, and I tried the steps you proposed here, by removing the most important variable (according to the coefficients), and re-running the model, but nothing changed. I tried also with:

pred <- fit$get_data() |> 
  predictor_transform(option = "norm")

as in #48, but the results keep being similar.

I didn't try the option 'scale' to transform the covariates, but only the option 'norm', I could try scaling them and check the differences.

Finally, I am getting another strange output map when using 'add_predictor_range', which I attached below.

#load range maps
range_report <- sf::st_read("G:/davide/Namibia/range_pres.shp")
atlas <- range_report[range_report$presence == 1, ]

# modeling phase
mod1_xgb <- distribution(background_res) |>
  add_predictors(envars_ready) |>
  add_biodiversity_poipa(fulldat_gbif, field_occurrence = "pres_abs", docheck = F) |>
  add_predictor_range(layer = atlas, distance_max = 3e5) |>
  engine_xgboost(booster = "gbtree", iter = 9000L, learning_rate = 0.0005, gamma = 2, subsample = 0.85)
gc()

fit_mod1_xgb <- ibis.iSDM::train(mod1_xgb, only_linear = FALSE, verbose = TRUE, optim_hyperparam = T, runname = "xgboost: gbif poipa + range predictor + global model")
gc()
plot(fit_mod1_xgb, "mean")

issue

I don't want to annoy you, I just want to make the most of the package's potential. Have you encountered output like this before?

Thank you for your patience! Best, Davide

Martin-Jung commented 2 months ago

Finally, I am getting another strange output map when using 'add_predictor_range', which I attached below.

What is strange about it? SDM Predictions are fundamentally a function of input data and covariates. In your case it could be that all your occurrences fall within the atlas range, or that the maximum distance is too short (with highly non-linear engines like boosted regression trees - XGBoost, you will probably get a non-linear distance decline anyway from the range edge, so could set this to the default Inf. It could also simply be that regardless of other covariates the prior presence of a atlas observation is the best predictor for high suitability and everything is else is secondary/barely noticeable. With PPM in particular you could try and see if you can get an offset (add_offset_range) to work.

I don't want to annoy you, I just want to make the most of the package's potential. Have you encountered output like this before?

Sure, but please keep it to actual bugs that are for example unexpected error messages or outputs. Have I encountered this before, sure. Then I simply tried something else. There is no such thing as a 'correct' SDM, only best practices and a range of ways and choices and assumptions of how models can be set up, each of why have consequences on the outputs.

mhesselbarth commented 2 months ago

@Daviser95

Also, which version of ibis.iSDM are you using? I remember at one point there was some issue with the inlabru engine. When you run the exact same example as on the Train simple model vignette, do you get similar results?

Daviser95 commented 2 months ago

Hi @mhesselbarth, I am using the version 0.1.2. So far I have mainly used engine_xgboost and engine_glmnet, because they are faster, and with these engines and 'poipo' datasets, I get the results attached. I tried the other engines, but always with 'poipa' datasets. Therefore, I could try to check if different engines work differently with 'poipo' datasets.

mhesselbarth commented 2 months ago

Maybe try updating to the newest dev version using remotes::install_github("https://github.com/iiasa/ibis.iSDM", ref = "dev"), but I think your version is recent enough so it's rather about the data/model specifications as explained by Martin

iiasa / ibis.iSDM

Invalid output when using 'add_biodiversity_poipo' #122