Closed LauraMWarmuth closed 6 years ago
Hi Laura,
I think I see what's going wrong, and it is pretty idiot, so I should change that for next users. In general, I recommend you to use the load_occ
and load_var
functions prior to modeling with SSDM in order to check occurrences and predictors automatically. Anyway it should work the way you did it. The issue you have is related to the ensemble modeling step, in fact to do an assemble of SDM you need at least two SDM (seems logical). But you asked the model to produce 1 algorithm (MAXENT) * 1 repetition = 1 SDM per genus. So the package can't compute an ensemble of this single SDM per genus. Anyway, what you wish to accomplish is to built an SSDM directly from several SDM without going through the ESDM stage. And that stupid of me but I did not allowed that in the current architecture of SSDM. So you have to choice, either we can cheat with the package to force him to make an ESDM with a single SDM (average of the SDM will be the SDM), or you can increase the repetition value to a minimum of 2. Because you are using pseudo-absence I would highly recommend you the second choice, because knowing your absence are stochastic, doing repetitions is important (look at Barbet-Massin et al, 2012).
I'll let you decide the way you want to do it, and I can help you to force the package to build SSDM only with SDM and not ESDM, but I recommend you to just bring the repetition value to 10.
Best,
Sylvain
Hi Sylvain,
I just tried to switch the repetition to 10 and also tried another time with algorithms = 'all' but these also result in the same error. I also used load_occ and load_var.
env <- load_var(path = here(), files = NULL, format = c(".grd", ".gri"), categorical = NULL, Norm = TRUE, tmp = TRUE, verbose = TRUE, GUI = FALSE)
occ <- load_occ(path = here(), env, file = "coral.occurrence.IAA.csv", sep = ',', Xcol = 'decimalLongitude', Ycol = 'decimalLatitude', Spcol = 'genus', GeoRes = TRUE, verbose = TRUE, GUI = FALSE)
stack_modelling('all', occ, env, Xcol = "decimalLongitude", Ycol = "decimalLatitude", Pcol = NULL, Spcol = "genus", rep = 10, save = TRUE, path = "C:/Users/Laura/Desktop/MSc Dissertation/SSDM", PA = NULL, cv = "holdout", cv.param = c(0.6, 10), axes.metric = "Pearson", uncertainty = TRUE, tmp = FALSE,
ensemble.metric = c("prop.correct"), ensemble.thresh = 0, weight = TRUE, method = "pSSDM", metric = "SES", range = NULL, endemism = NULL, verbose = TRUE, GUI = FALSE, cores = 1)
Maybe there is something else going on? How would I include the Ensemble modelling stage into my code to lead to my genus richness maps?
Thanks, Laura
The ensemble modelling stage is included in the stack_modelling
function.
Could you give me the error output ?
That's what I get in my R window:
stack_modelling('all', occ, env, Xcol = "decimalLongitude", Ycol = "decimalLatitude", Pcol = NULL, Spcol = "genus", rep = 10, save = TRUE, path = "C:/Users/Laura/Desktop/MSc Dissertation/SSDM", PA = NULL, cv = "holdout", cv.param = c(0.6, 10), axes.metric = "Pearson", uncertainty = TRUE, tmp = FALSE,
- ensemble.metric = c("prop.correct"), ensemble.thresh = 0, weight = TRUE, method = "pSSDM", metric = "SES", range = NULL, endemism = NULL, verbose = TRUE, GUI = FALSE, cores = 1)
Ensemble models creation
Opening clusters, 1 cores Exporting environment to clusters Closing clusters Error in stack_modelling("all", occ, env, Xcol = "decimalLongitude", Ycol = "decimalLatitude", : You have less than two remaining specie ensemble models, maybe you should try an easier thresholding ?
Thanks, Laura
Ok and could you give a summary from your occurrences and environmental variables ?
You mean a summary like this?
summary(occ) decimalLongitude decimalLatitude
Min. :-179.0 Min. :-23.500
1st Qu.: 123.0 1st Qu.:-20.412
Median : 145.6 Median :-16.758
Mean : 123.1 Mean :-13.352
3rd Qu.: 152.1 3rd Qu.: -9.607
Max. : 180.0 Max. : 23.000summary(env) env.variables.IAA.1 env.variables.IAA.2 Min. 4.559512e-01 4.559512e-01 1st Qu. 7.849809e-01 7.849809e-01 Median 8.347409e-01 8.347409e-01 3rd Qu. 8.691825e-01 8.691825e-01 Max. 9.973268e-01 9.973268e-01 NA's 5.843940e+05 5.843940e+05 Warning message: In .local(object, ...) : summary is an estimate based on a sample of 1e+05 cells (4.12% of all cells)
You don't have any "genus" column into the occ
data frame ?
So I think this is simply the issue. The error:
You have less than two remaining specie ensemble models, maybe you should try an easier thresholding ?
means that you have only one ESDM remaining to do the SSDM. So either the thresholding is too strong but you lowered it to 0, either you only have one genus modeled, which explains everything.
You did not used that code:
occ <- load_occ(path = here(), env, file = "coral.occurrence.IAA.csv",
sep = ',', Xcol = 'decimalLongitude', Ycol = 'decimalLatitude',
Spcol = 'genus', GeoRes = TRUE, verbose = TRUE, GUI = FALSE)
Because in that case you should have a genus column or an error saying it doesn't exists. Or maybe you did something to the data frame after.
Dear Sylvain,
thanks a lot for your quick reply! Yes, that was stupid of me! The data needs of course a genus column! I used only the coordinate dataframe which I used for dismo maxent modelling before. Interestingly, I didn't think of it because no error occurred when I ran the following load_occ code in R studio?
occ <- load_occ(path = here(), env.var.coral.triangle.SSDM, file = "coral.occurrence.IAA.csv",
- sep = ',', Xcol = 'decimalLongitude', Ycol = 'decimalLatitude',
- Spcol = 'genus', GeoRes = TRUE, verbose = TRUE, GUI = FALSE) Occurrences loading Warning message: In load_occ(path = here(), env.var.coral.triangle.SSDM, file = "coral.occurrence.IAA.csv", : You have occurrences that aren't in the extent of your environmental variables, they will be automatically removed !
The warning was expected since I have many data points on land which I want to be removed since I am looking at corals.
Now I don't get the threshold error anymore, and the code is running with the following info in the command window in R studio: stack_modelling('MAXENT', occurrence.coral.triangle.SSDM, env.var.coral.triangle.SSDM, Xcol = "decimalLongitude", Ycol = "decimalLatitude", Pcol = NULL, Spcol = "genus", rep = 2, name = NULL, save = TRUE, path = "C:/Users/Laura/Desktop/MSc Dissertation/SSDM", PA = NULL, cv = "k-fold", cv.param = c(5, 2), axes.metric = "Pearson", uncertainty = TRUE, tmp = FALSE, ensemble.metric = c("AUC"), ensemble.thresh = c(0.8), weight = TRUE, method = "pSSDM", metric = "SES", range = NULL, endemism = NULL, verbose = TRUE, GUI = FALSE, cores = 2)
" #### Ensemble models creation #####
Opening clusters, 2 cores Exporting environment to clusters "
I have 1580 observations of 146 genera. How long would you expect the code to run? I left it for about 3h and it was still working.
Thanks, Laura
Ok first besides it was a user mistake it's still good you noticed it because the package should have warned you that you were missing the genus column in the load_occ
function.
The load_occ
function calls the internal function .checkargs
to check user arguments and avoid this issue:
https://github.com/sylvainschmitt/SSDM/blob/8cc42286b7a8fc081c3450ac44c80f878a0d6661/R/load_occ.R#L45
And effectivelly the only thing checked by .checkargs
function is that the Spcol
parameter is either null or a character, but it does not actually check that Spcol
is in the data frame:
https://github.com/sylvainschmitt/SSDM/blob/8cc42286b7a8fc081c3450ac44c80f878a0d6661/R/checkargs.R#L56-L57
So I'll add to my ToDo list to add a argument check for column existence in the load_occ
function.
Regarding the computation time what matters is your computer performance and:
MAXENT
the longestIt's pretty impossible for me to predict computation time with so few informations, but we included time stamp between species and steps, for the user to have an estimate of how long does the package takes to model one species. So you can roughly multiply that number by 146.
Another trick to check the progression is to have a look to temporary files but you did not used them here (tmp = FALSE
, I recommend to put to TRUE
).
Last thing for large computations you should consider using a High Performance Computer (HPC) or cluster if your lab has access to one.
Yes, I thought that would be a helpful comment for the package. Thank you for taking your time for your reply!
I switched the tmp factor to TRUE and got back to the original error: " Error in stack_modelling("MAXENT", occurrence.coral.triangle.SSDM, env.var.coral.triangle.SSDM, : You have less than two remaining specie ensemble models, maybe you should try an easier thresholding ? "
My code looks like this: env.var.coral.triangle.SSDM <- load_var(path = here(), files = NULL, format = c(".grd", ".gri"), categorical = NULL, Norm = TRUE, tmp = TRUE, verbose = TRUE, GUI = FALSE)
occurrence.coral.triangle.SSDM <- load_occ(path = here(), env.var.coral.triangle.SSDM, file = "coral.triangle.genera.coord.new.worms.csv", sep = ',', Xcol = 'decimalLongitude', Ycol = 'decimalLatitude', Spcol = 'genus', GeoRes = TRUE, verbose = TRUE, GUI = FALSE)
stack_modelling('MAXENT', occurrence.coral.triangle.SSDM, env.var.coral.triangle.SSDM, Xcol = "decimalLongitude", Ycol = "decimalLatitude", Pcol = NULL, Spcol = "genus", rep = 10, name = NULL, save = TRUE, path = "C:/Users/Laura/Desktop/MSc Dissertation/SSDM", PA = NULL, cv = "k-fold", cv.param = c(5, 10), axes.metric = "Pearson", uncertainty = TRUE, tmp = TRUE, ensemble.metric = c("AUC"), ensemble.thresh = c(0), weight = TRUE, method = "pSSDM", metric = "SES", range = NULL, endemism = NULL, verbose = TRUE, GUI = FALSE, cores = 2)
The error remains, even after I set the threshold to 0 and increased repetitions to 10. This is my new occurrence dataset: coral.triangle.genera.coord.new.worms.xlsx
just had a quick test in the GUI, and it doesn't accept my raster, even though it is certainly a .grd file:
Raster in R studio:
class(env.variables.BO2.coral.triangle) [1] "RasterBrick" attr(,"package") [1] "raster" library(raster) writeRaster(env.variables.BO2.coral.triangle, file = "env.variables.BO2.coral.triangle.new.grd", bylayer = FALSE) gui()
Listening on http://127.0.0.1:5605 Error in .checkargs(path = path, files = files, format = format, categorical = categorical, : format parameter should be .grd, .tif, .asc, .sdat, .rst, .nc, .tif, .envi, .bil or .img
In the GUI I always get the message: " Environmental variables loading failed, please check your inputs and try again "
with either .grd or .tif files for my environmental parameters.
For my .csv file for the occurrence table I don't get an error but it does not load anything. I followed exactly your steps from the tutorial.
Hi Sylvain,
as mentioned, I am using your package SSDM to predict coral species richness. For now, I am only using maxent and I have a occurrence dataset from GBIF and an environmental rasterstack from Bio-ORACLE.
I tried different settings and even when I set the AUC scores or prop.correct to zero I get the following error:
"You have less than two remaining specie ensemble models, maybe you should try an easier thresholding ?"
My code is: stack_modelling('MAXENT', coral_occurrence_IAA, env.variables.IAA, Xcol = "decimalLongitude", Ycol = "decimalLatitude", Pcol = NULL, Spcol = "genus", rep = 1, save = TRUE, path = "C:/Users/Laura/Desktop/MSc Dissertation/SSDM", PA = NULL, cv = "holdout", cv.param = c(0.6, 10), axes.metric = "Pearson", uncertainty = TRUE, tmp = FALSE, ensemble.metric = c("prop.correct"), ensemble.thresh = 0, weight = TRUE, method = "pSSDM", metric = "SES", range = NULL, endemism = NULL, verbose = TRUE, GUI = FALSE, cores = 1)
I attached my occurrence dataset. coral.occurrence.IAA.xlsx
I, unfortunately, can't upload my RasterStack file here. So here is the code how to retrieve the used RasterStack:
library(sdmpredictors) datasets <- list_datasets(terrestrial = FALSE, marine = TRUE) layers <- list_layers(datasets) env.variables <- load_layers(c("BO_sstmean", "BO_salinity", "BO_dissox", "BO_bathymean"), datadir = "/Bio-ORACLE")
library(readxl) coral_occurrence_IAA <- read_excel("coral.occurrence.IAA.xlsx") View(coral_occurrence_IAA) attach(coral_occurrence_IAA) max.lat <- ceiling(max(decimalLatitude)) min.lat <- floor(min(decimalLatitude)) max.lon <- ceiling(max(decimalLongitude)) min.lon <- floor(min(decimalLongitude)) geographic.extent <- extent(x = c(min.lon, max.lon, min.lat, max.lat)) env.variables.IAA <- crop(x = env.variables, y = geographic.extent)
Thanks already in advance!
Best wishes, Laura