Final parameters for xgboost models and saving model information

dschwilk commented 7 years ago

There are a few outstanding questions regarding the spatial modeling:

There are comments in the code (predict-spatial.R) by @hpoulos that may not still apply. For example, on line 99, there is a reference to incorrect rsquareds being recorded. But I could not duplicate or figure this out. Can you check that this is fixed, @hpoulos ? I also see a reference to failed model comparison (line 161) --- but I'm not sure why that does not work and need some example code to test it.
Are these good parameters for our final model runs? I suspect that some values affecting teh search space might be still set for fast practice runs
We are not saving model results. See the last couple of comments in #34 where I have str()'d the object -- what values here are worth saving and useful for comparison?

hpoulos commented 7 years ago

We don't necessarily need to save the model results. What I'd like to do is to run the rf model, run the xgboost model, and then select the model with the best fit and use that model for the predict function to predict tmin and tmax. Does that make sense? Caret optimizes the fits of xgboost and rf and selects the optimal model, but there may be some cases where the rf model does a better job than the boosted regression tree. We should select the best fit for each model and then use that model with the predict function to generate the grids. We also need some way to get an average RMSE/R-sq value for each tmin, tmax grid for each mountain range so that we can report that in the publication as a measure of how good the map product fits are.

dschwilk commented 7 years ago

OK, so we need to add back in code to run a regular random forest model on each as well. And what values from the xgboost model object should we save? Te object contains results from a set of resampled models. See #34. Also, I'm not sure we can really make claims about better and best based on rmse/r^2 comparison like that can we? At that point we would need to have hold-out test data, right?

dschwilk commented 7 years ago

So that object, m, has m$resample$Rsquared:

$ resample    :'data.frame':    5 obs. of  3 variables:
  ..$ RMSE    : num [1:5] 0.00249 0.00309 0.00156 0.00236 0.0012
  ..$ Rsquared: num [1:5] 0.923 0.957 0.941 0.981 0.982
  ..$ Resample: chr [1:5] "Fold4" "Fold2" "Fold3" "Fold5" ...

Or m$finalModel (but I don't see an rsquared) :

 $ finalModel  :List of 10
  ..$ handle     :Class 'xgb.Booster.handle' <externalptr> 
  ..$ raw        : raw [1:33816] 00 00 00 3f ...
  ..$ niter      : num 150
  ..$ call       : language xgb.train(params = list(eta = param$eta, max_depth = param$max_depth, gamma = param$gamma,      colsample_bytree = param$colsample_bytree, min_child_weight = param$min_child_weight,  ...
  ..$ params     :List of 8
  .. ..$ eta             : num 0.4
  .. ..$ max_depth       : int 1
  .. ..$ gamma           : num 0
  .. ..$ colsample_bytree: num 0.8
  .. ..$ min_child_weight: num 1
  .. ..$ subsample       : num 0.722
  .. ..$ objective       : chr "reg:linear"
  .. ..$ silent          : num 1
  ..$ callbacks  :List of 1
  .. ..$ cb.print.evaluation:function (env = parent.frame())  
  .. .. ..- attr(*, "call")= language cb.print.evaluation(period = print_every_n)
  .. .. ..- attr(*, "name")= chr "cb.print.evaluation"
  ..$ xNames     : chr [1:7] "elev" "ldist_ridge" "ldist_valley" "msd" ...
  ..$ problemType: chr "Regression"
  ..$ tuneValue  :'data.frame': 1 obs. of  7 variables:
  .. ..$ nrounds         : num 150
  .. ..$ max_depth       : int 1
  .. ..$ eta             : num 0.4
  .. ..$ gamma           : num 0
  .. ..$ colsample_bytree: num 0.8
  .. ..$ min_child_weight: num 1
  .. ..$ subsample       : num 0.722
  ..$ obsLevels  : logi NA
  ..- attr(*, "class")= chr "xgb.Booster"

hpoulos commented 7 years ago

This would be the thing to save, if we save anything: m$resample$Rsquared

As for assessing model fits, the current Caret model uses 10 fold cross validation, so we can use rmse or Rsq to assess model fits.

hpoulos commented 7 years ago

The idea is that sometimes, random forest may provide a better fit than the boosted regression tree. In that case, we would want to use an RF model rather than a boosted tree model. Caret does model selection, but only for one algorithm at a time.

dschwilk commented 7 years ago

OK, can save the m$resample$Rsquared object which is a list. I can str() it to a file for future reference. I am not implementing this final model comparison but you are welcome to. 1) I don't undertand enough of the model structure to know what to compare (how do I compare a singled r squared toa list?) and 2) I do not see the logic of using r squared (really pseudo rsquareds, right?) because they behave very poorly in model selection but I think you understand something here I'm missing. Can you go ahead and implement this? I will add writing the str() version of the m$resample$Rsquared object to a file and then close this.

dschwilk commented 7 years ago

Code currently prints the whole xboost object $resample slot (see https://github.com/schwilklab/skyisland-climate/blob/master/scripts/predict-spatial.R#L129). So @hpoulos suggestion 3 comments above is already solved. But I don't see how to use these values to do the comparisons as suggested above.

@hpoulos : please check this code and confirm it is the correct way to fit these models. There will be a lot of time spent on the predictions based on these so I'd like to make sure it is correct. Also confirm that the parameters to these model fitting functions have appropriate parameters for our final production code (number of reps, etc).

dschwilk commented 7 years ago

And a link to some discussion of these parameters: https://stats.stackexchange.com/questions/171043/how-to-tune-hyperparameters-of-xgboost-trees

hpoulos commented 7 years ago

I set the trainControl parameters based on the fact that we are not subsetting our our data into training and validation datasets. This means that we are limited to fitting models using cross-validation. I have set the CV to 5 for xgboost for now because it takes a long time to run. The final model should be run with 10 fold CV. I have also centered and rescaled the data based on what I found online in terms of documentation and based on a conversation I had with a data miner here at Wesleyan. So, those training parameters are correct for specifying the xgboost models for each mountain range.

dschwilk commented 7 years ago

Great, does the rescaling centering affect our predictions, however? Just confirm that predict() as we use it gives us what we expect.

hpoulos commented 7 years ago

This is something we talked about a while back. It gives better model fits for sure, and is often used in these types of modeling procedures.

hpoulos commented 7 years ago

As for picking the best model, my concept is derived from something like what you'll find here: https://github.com/enanomapper/RRegrs

One thing I've been wishing we could do from the beginning of the project is to fit a bunch of models and then pick the best fit for each variable (tmin, tmax)/mtn range. I just came across this package, and I've gone through the tutorial, but that's it. We could, theoretically fit a bunch of different algorithms and then just pick the best fitting model from a large suite of models.

That said, xgboost is the hottest classification algorithm out there right now. RF also works well, so we could just run those two and then pick the best model for each range based on the following...

From what I can tell, the Predict-Spatial.R code does not save the RMSE statistics for the PCs for each range for both models. Right now it just makes one set of predictions based on xgboost. Is that correct? If we can get RMSE for each PC for both RF and xgboost, then we can compare the two and select the better model. So, as per our email earlier today. We need to add a couple of lines of code to fit both the rf and the xgboost models and then save both with their rmse fit statistics. Then we can either through code or manually go through and figure out which is the better fitting model and select it for the subsequent tmin tmax modeling steps. Make sense?

hpoulos commented 7 years ago

I see where that needs to happen, but the way you've wrapped up the code makes it hard for me to put it in myself.

Helen

dschwilk commented 7 years ago

OK let me help you understand the code then since I know that but do not know how to correctly select a "best" model when fit with different methods unless we do testing on same testing data set. Or do both methods try similar reps on subsampling for testing so we can compare some average? Eg the average of the m$resample$Rsquared vector? but I think the random forest code we have only gives a single value (sit aht already the mean over the testing subsamples?

In any case, do you want to call and I can explain the code or do you want me to add more comments?

dschwilk commented 7 years ago

"One thing I've been wishing we could do from the beginning of the project is to fit a bunch of models and then pick the best fit for each variable (tmin, tmax)/mtn range"

Right now we do try to pick the best model for variable, mtn range and PC axis, but we rely on the automatic model selection inherent in xgboost. This is exactly what the code is set up to do, cycle through all combos of mtn, variable, and PC axis and fit a model or models. If you have a model selection criterion, then it is easy enough to add arbitrary more model fits to the fitModelRunDiagnostics() function and just predict with the best one. Right now the function predicts as its last line.

hpoulos commented 7 years ago

I have figured out how to get rf models to generate rmse statistics. It was just a specification in the train command.

Here are the two revised functions for fitting 1) and rf model, and 2) an xgboost model:

library(doMC)
#Fit a boosted regression tree model
fitboost <- function(df, dep.var) {
  ind.vars <- paste(IND_VAR_NAMES, collapse=" + ")
  formula <- as.formula(paste(dep.var, " ~ ", ind.vars))
  xgmodel <- train(formula, data = df, tuneLength = 10,
                   method = "xgbTree",metric="RMSE",
                   trControl = trainControl(method = "cv", number = 3,
                                            verboseIter = TRUE))
  return(xgmodel)

}

# fit the random forest model
fitRandomForest <- function(df, dep.var) {
  ind.vars <- paste(IND_VAR_NAMES, collapse=" + ")
  formula <- as.formula(paste(dep.var, " ~ ", ind.vars))
  model <- train(formula, data = df, tuneLength = 10,
                 method = "rf", metric = "RMSE",
                 trControl = trainControl(method = "cv", number = 3, verboseIter = TRUE))
  return(model)

}

Here is what we need to compare the two models:

#compare fits between rf output and boosted regression tree
#this is the code that will compare two different caret model objects. 
resamps <- resamples(list(xgboost=mod, RF=modrf))
summary(resamps)
modelDifferences <- diff(resamps)
summary(modelDifferences)
bwplot(modelDifferences, layout = c(2, 1),
       scales = list(x = list(relation="free")))

hpoulos commented 7 years ago

I have also determined that we should only center and rescale the data if we are also centering and rescaling all of the grids in the predict function, as we discussed on the phone, so I have taken the pre-processing out of the function.

hpoulos commented 7 years ago

Both of these functions will run now in the script. However, I cannot figure out how to get the code to generate two sets of outputs: one for the rf models, and another for the xgboost models and a summary file that has all of the rmse for both rf and boost models for all sites and all pcs.

Then I need the code to compare the two using the resamples() command and print that out somewhere for each of the models that are generated--CM_tmax_PC1_rf, CM_tmax_PC1_boost, etc. for each mountain range so we can do some model selection at the end after generating both of the models for each pc/temp/mtn combination. I tried to run the rf followed by a boost model, and the code runs but the second boost model overwrites all of the output from the rf models as the code is written now.

hpoulos commented 7 years ago

Also, right now in what I pasted above, the cv runs for three iterations. The final model should be 10. We might also consider using the doMC package to run this using parallel processing. That is often suggested online because carat takes a long time to run.

dschwilk commented 7 years ago

Closing. I assume we are happy with the current models since we have committed to the reconstructions based on them.

schwilklab / skyisland-climate

Final parameters for xgboost models and saving model information #41