`load_occ` does'nt alert that `Spcol` does not exist in occurrences data frame

LauraMWarmuth commented 6 years ago

Hi Sylvain,

as mentioned, I am using your package SSDM to predict coral species richness. For now, I am only using maxent and I have a occurrence dataset from GBIF and an environmental rasterstack from Bio-ORACLE.

I tried different settings and even when I set the AUC scores or prop.correct to zero I get the following error:

"You have less than two remaining specie ensemble models, maybe you should try an easier thresholding ?"

My code is: stack_modelling('MAXENT', coral_occurrence_IAA, env.variables.IAA, Xcol = "decimalLongitude", Ycol = "decimalLatitude", Pcol = NULL, Spcol = "genus", rep = 1, save = TRUE, path = "C:/Users/Laura/Desktop/MSc Dissertation/SSDM", PA = NULL, cv = "holdout", cv.param = c(0.6, 10), axes.metric = "Pearson", uncertainty = TRUE, tmp = FALSE, ensemble.metric = c("prop.correct"), ensemble.thresh = 0, weight = TRUE, method = "pSSDM", metric = "SES", range = NULL, endemism = NULL, verbose = TRUE, GUI = FALSE, cores = 1)

I attached my occurrence dataset. coral.occurrence.IAA.xlsx

I, unfortunately, can't upload my RasterStack file here. So here is the code how to retrieve the used RasterStack:

library(sdmpredictors) datasets <- list_datasets(terrestrial = FALSE, marine = TRUE) layers <- list_layers(datasets) env.variables <- load_layers(c("BO_sstmean", "BO_salinity", "BO_dissox", "BO_bathymean"), datadir = "/Bio-ORACLE")

library(readxl) coral_occurrence_IAA <- read_excel("coral.occurrence.IAA.xlsx") View(coral_occurrence_IAA) attach(coral_occurrence_IAA) max.lat <- ceiling(max(decimalLatitude)) min.lat <- floor(min(decimalLatitude)) max.lon <- ceiling(max(decimalLongitude)) min.lon <- floor(min(decimalLongitude)) geographic.extent <- extent(x = c(min.lon, max.lon, min.lat, max.lat)) env.variables.IAA <- crop(x = env.variables, y = geographic.extent)

Thanks already in advance!

Best wishes, Laura

sylvainschmitt commented 6 years ago

Hi Laura,

I think I see what's going wrong, and it is pretty idiot, so I should change that for next users. In general, I recommend you to use the load_occ and load_var functions prior to modeling with SSDM in order to check occurrences and predictors automatically. Anyway it should work the way you did it. The issue you have is related to the ensemble modeling step, in fact to do an assemble of SDM you need at least two SDM (seems logical). But you asked the model to produce 1 algorithm (MAXENT) * 1 repetition = 1 SDM per genus. So the package can't compute an ensemble of this single SDM per genus. Anyway, what you wish to accomplish is to built an SSDM directly from several SDM without going through the ESDM stage. And that stupid of me but I did not allowed that in the current architecture of SSDM. So you have to choice, either we can cheat with the package to force him to make an ESDM with a single SDM (average of the SDM will be the SDM), or you can increase the repetition value to a minimum of 2. Because you are using pseudo-absence I would highly recommend you the second choice, because knowing your absence are stochastic, doing repetitions is important (look at Barbet-Massin et al, 2012).

I'll let you decide the way you want to do it, and I can help you to force the package to build SSDM only with SDM and not ESDM, but I recommend you to just bring the repetition value to 10.

Best,

Sylvain

LauraMWarmuth commented 6 years ago

Hi Sylvain,

I just tried to switch the repetition to 10 and also tried another time with algorithms = 'all' but these also result in the same error. I also used load_occ and load_var.

env <- load_var(path = here(), files = NULL, format = c(".grd", ".gri"), categorical = NULL, Norm = TRUE, tmp = TRUE, verbose = TRUE, GUI = FALSE)
occ <- load_occ(path = here(), env, file = "coral.occurrence.IAA.csv", sep = ',', Xcol = 'decimalLongitude', Ycol = 'decimalLatitude', Spcol = 'genus', GeoRes = TRUE, verbose = TRUE, GUI = FALSE)

stack_modelling('all', occ, env, Xcol = "decimalLongitude", Ycol = "decimalLatitude", Pcol = NULL, Spcol = "genus", rep = 10, save = TRUE, path = "C:/Users/Laura/Desktop/MSc Dissertation/SSDM", PA = NULL, cv = "holdout", cv.param = c(0.6, 10), axes.metric = "Pearson", uncertainty = TRUE, tmp = FALSE,
                ensemble.metric = c("prop.correct"), ensemble.thresh = 0, weight = TRUE, method = "pSSDM", metric = "SES", range = NULL, endemism = NULL, verbose = TRUE, GUI = FALSE, cores = 1)

Maybe there is something else going on? How would I include the Ensemble modelling stage into my code to lead to my genus richness maps?

Thanks, Laura

sylvainschmitt commented 6 years ago

The ensemble modelling stage is included in the stack_modelling function.

sylvainschmitt commented 6 years ago

Could you give me the error output ?

LauraMWarmuth commented 6 years ago

That's what I get in my R window:

stack_modelling('all', occ, env, Xcol = "decimalLongitude", Ycol = "decimalLatitude", Pcol = NULL, Spcol = "genus", rep = 10, save = TRUE, path = "C:/Users/Laura/Desktop/MSc Dissertation/SSDM", PA = NULL, cv = "holdout", cv.param = c(0.6, 10), axes.metric = "Pearson", uncertainty = TRUE, tmp = FALSE,

ensemble.metric = c("prop.correct"), ensemble.thresh = 0, weight = TRUE, method = "pSSDM", metric = "SES", range = NULL, endemism = NULL, verbose = TRUE, GUI = FALSE, cores = 1)
Ensemble models creation

Opening clusters, 1 cores Exporting environment to clusters Closing clusters Error in stack_modelling("all", occ, env, Xcol = "decimalLongitude", Ycol = "decimalLatitude", : You have less than two remaining specie ensemble models, maybe you should try an easier thresholding ?

Thanks, Laura

sylvainschmitt commented 6 years ago

Ok and could you give a summary from your occurrences and environmental variables ?

LauraMWarmuth commented 6 years ago

You mean a summary like this?

summary(occ) decimalLongitude decimalLatitude
Min. :-179.0 Min. :-23.500
1st Qu.: 123.0 1st Qu.:-20.412
Median : 145.6 Median :-16.758
Mean : 123.1 Mean :-13.352
3rd Qu.: 152.1 3rd Qu.: -9.607
Max. : 180.0 Max. : 23.000

summary(env) env.variables.IAA.1 env.variables.IAA.2 Min. 4.559512e-01 4.559512e-01 1st Qu. 7.849809e-01 7.849809e-01 Median 8.347409e-01 8.347409e-01 3rd Qu. 8.691825e-01 8.691825e-01 Max. 9.973268e-01 9.973268e-01 NA's 5.843940e+05 5.843940e+05 Warning message: In .local(object, ...) : summary is an estimate based on a sample of 1e+05 cells (4.12% of all cells)

sylvainschmitt commented 6 years ago

You don't have any "genus" column into the occ data frame ?

sylvainschmitt commented 6 years ago

So I think this is simply the issue. The error:

You have less than two remaining specie ensemble models, maybe you should try an easier thresholding ?

means that you have only one ESDM remaining to do the SSDM. So either the thresholding is too strong but you lowered it to 0, either you only have one genus modeled, which explains everything.

sylvainschmitt commented 6 years ago

You did not used that code:

occ <- load_occ(path = here(), env, file = "coral.occurrence.IAA.csv", 
                           sep = ',', Xcol = 'decimalLongitude', Ycol = 'decimalLatitude', 
                           Spcol = 'genus', GeoRes = TRUE, verbose = TRUE, GUI = FALSE)

Because in that case you should have a genus column or an error saying it doesn't exists. Or maybe you did something to the data frame after.

LauraMWarmuth commented 6 years ago

Dear Sylvain,

thanks a lot for your quick reply! Yes, that was stupid of me! The data needs of course a genus column! I used only the coordinate dataframe which I used for dismo maxent modelling before. Interestingly, I didn't think of it because no error occurred when I ran the following load_occ code in R studio?

occ <- load_occ(path = here(), env.var.coral.triangle.SSDM, file = "coral.occurrence.IAA.csv",

sep = ',', Xcol = 'decimalLongitude', Ycol = 'decimalLatitude',

Spcol = 'genus', GeoRes = TRUE, verbose = TRUE, GUI = FALSE) Occurrences loading Warning message: In load_occ(path = here(), env.var.coral.triangle.SSDM, file = "coral.occurrence.IAA.csv", : You have occurrences that aren't in the extent of your environmental variables, they will be automatically removed !

The warning was expected since I have many data points on land which I want to be removed since I am looking at corals.

Now I don't get the threshold error anymore, and the code is running with the following info in the command window in R studio: stack_modelling('MAXENT', occurrence.coral.triangle.SSDM, env.var.coral.triangle.SSDM, Xcol = "decimalLongitude", Ycol = "decimalLatitude", Pcol = NULL, Spcol = "genus", rep = 2, name = NULL, save = TRUE, path = "C:/Users/Laura/Desktop/MSc Dissertation/SSDM", PA = NULL, cv = "k-fold", cv.param = c(5, 2), axes.metric = "Pearson", uncertainty = TRUE, tmp = FALSE, ensemble.metric = c("AUC"), ensemble.thresh = c(0.8), weight = TRUE, method = "pSSDM", metric = "SES", range = NULL, endemism = NULL, verbose = TRUE, GUI = FALSE, cores = 2)

" #### Ensemble models creation #####

Opening clusters, 2 cores Exporting environment to clusters "

I have 1580 observations of 146 genera. How long would you expect the code to run? I left it for about 3h and it was still working.

Thanks, Laura

sylvainschmitt commented 6 years ago

Ok first besides it was a user mistake it's still good you noticed it because the package should have warned you that you were missing the genus column in the load_occ function.

The load_occ function calls the internal function .checkargs to check user arguments and avoid this issue: https://github.com/sylvainschmitt/SSDM/blob/8cc42286b7a8fc081c3450ac44c80f878a0d6661/R/load_occ.R#L45

And effectivelly the only thing checked by .checkargs function is that the Spcol parameter is either null or a character, but it does not actually check that Spcol is in the data frame: https://github.com/sylvainschmitt/SSDM/blob/8cc42286b7a8fc081c3450ac44c80f878a0d6661/R/checkargs.R#L56-L57

So I'll add to my ToDo list to add a argument check for column existence in the load_occ function.

sylvainschmitt commented 6 years ago

Regarding the computation time what matters is your computer performance and:

the algorithm, here MAXENT the longest
the number of genus and repetitions, here 146 genera and 2 repetitions resulting in 292 SDM, 146 ESDM and 1 SSDM to calculate
the number of observations is minor in the computation time, and here you have ca 10 observation per genera (1580/146) which is pretty low and should not influence computation time (but decrease model accuracy)
the number of cores you use, here 2 (but it does not mean you divide by two computation time)

It's pretty impossible for me to predict computation time with so few informations, but we included time stamp between species and steps, for the user to have an estimate of how long does the package takes to model one species. So you can roughly multiply that number by 146.

Another trick to check the progression is to have a look to temporary files but you did not used them here (tmp = FALSE, I recommend to put to TRUE).

Last thing for large computations you should consider using a High Performance Computer (HPC) or cluster if your lab has access to one.

LauraMWarmuth commented 6 years ago

Yes, I thought that would be a helpful comment for the package. Thank you for taking your time for your reply!

I switched the tmp factor to TRUE and got back to the original error: " Error in stack_modelling("MAXENT", occurrence.coral.triangle.SSDM, env.var.coral.triangle.SSDM, : You have less than two remaining specie ensemble models, maybe you should try an easier thresholding ? "

My code looks like this: env.var.coral.triangle.SSDM <- load_var(path = here(), files = NULL, format = c(".grd", ".gri"), categorical = NULL, Norm = TRUE, tmp = TRUE, verbose = TRUE, GUI = FALSE)

occurrence.coral.triangle.SSDM <- load_occ(path = here(), env.var.coral.triangle.SSDM, file = "coral.triangle.genera.coord.new.worms.csv", sep = ',', Xcol = 'decimalLongitude', Ycol = 'decimalLatitude', Spcol = 'genus', GeoRes = TRUE, verbose = TRUE, GUI = FALSE)

stack_modelling('MAXENT', occurrence.coral.triangle.SSDM, env.var.coral.triangle.SSDM, Xcol = "decimalLongitude", Ycol = "decimalLatitude", Pcol = NULL, Spcol = "genus", rep = 10, name = NULL, save = TRUE, path = "C:/Users/Laura/Desktop/MSc Dissertation/SSDM", PA = NULL, cv = "k-fold", cv.param = c(5, 10), axes.metric = "Pearson", uncertainty = TRUE, tmp = TRUE, ensemble.metric = c("AUC"), ensemble.thresh = c(0), weight = TRUE, method = "pSSDM", metric = "SES", range = NULL, endemism = NULL, verbose = TRUE, GUI = FALSE, cores = 2)

The error remains, even after I set the threshold to 0 and increased repetitions to 10. This is my new occurrence dataset: coral.triangle.genera.coord.new.worms.xlsx

LauraMWarmuth commented 6 years ago

just had a quick test in the GUI, and it doesn't accept my raster, even though it is certainly a .grd file:

Raster in R studio:

class(env.variables.BO2.coral.triangle) [1] "RasterBrick" attr(,"package") [1] "raster" library(raster) writeRaster(env.variables.BO2.coral.triangle, file = "env.variables.BO2.coral.triangle.new.grd", bylayer = FALSE) gui()

Listening on http://127.0.0.1:5605 Error in .checkargs(path = path, files = files, format = format, categorical = categorical, : format parameter should be .grd, .tif, .asc, .sdat, .rst, .nc, .tif, .envi, .bil or .img

LauraMWarmuth commented 6 years ago

In the GUI I always get the message: " Environmental variables loading failed, please check your inputs and try again "

with either .grd or .tif files for my environmental parameters.

For my .csv file for the occurrence table I don't get an error but it does not load anything. I followed exactly your steps from the tutorial.

sylvainschmitt / SSDM

`load_occ` does'nt alert that `Spcol` does not exist in occurrences data frame #34

Ensemble models creation