schwilklab / skyisland-climate

Climate data and code for Sky Island project
2 stars 2 forks source link

Notes on spatial and temporal temperature modeling #34

Closed dschwilk closed 7 years ago

dschwilk commented 8 years ago

Overall notes:

I am using R terminology of "scores" and "loadings"

Right now the code does not deal with poor models well. If no model is good fit then we should use the null for prediction (eg reconstruct scores for all dates as mean axis score and loadings for all locations as mean axis loading). We need to change this.

I have somewhat arbitrarily chosen the following topo variables:

c("elev","ldist_ridge" , "ldist_valley",  "msd", "radiation","relev_l", "slope")

zdist_ridge and zdist_valley were too tightly correlated with elevation to be useful.

Notes on spatial models (random forest)

hpoulos commented 7 years ago

I just ran all of the scripts on skyisland-climate and there are a few errors.

In predict-spatial.R and a couple of other scripts (microclimate-topo-PCA.R), I get the following error message in Rstudio

Error in if (time > data.time) { : missing value where TRUE/FALSE needed

Especially here in the microclimate-topo-PCA.R PCAs <- loadPCAData()

Which means that I then don't have the PCAs object to run the two predict scripts.

hpoulos commented 7 years ago

Right now the RF model uses defaults. A couple of things can influence model fits, and varying the tuning parameters can improve the model fits. For example, we can vary the number of trees generated, the mtry value (the number of splitter variables tried at each split), and we can also use only the top performing predictor variables.

Another element that's in the scripts is whether or not to use training and test data to evaluate model performance. Originally, we had decided not to split the data because of the small number of ibuttons in each mountain range. That means that we cannot generate ROC or AUC statistics to estimate the model fits. This may be important for evaluating individual model performance.

Another thing I am wondering (but can't look at myself until the scripts run for me) is whether poorly fitting models are far from the envelope of factor loadings. Your suggestion to subset the climate data may be labor intensive, but using all measurements to generate a single PCA may lead to poor model fits. Perhaps we should just try generating a PCA for Jan tmin for one of the mountain ranges and seeing how that influences the RF model fit and then the subsequent termporal backcasting.

dschwilk commented 7 years ago

@hpoulos : any progress on the RF models? Continuing ont he road we are already on, I can modify the codebase to use our current PCA decomposition -> modeling setup to reconstruct one year at a time and then only save the bioclim annual summaries. I will do that, but meanwhile you were going to check the spatial modeling and deal with the poor model fits.

hpoulos commented 7 years ago

I'm in Tucson this week at the AFE meeting and was working on the presentation all week after JFSP (and 3 other submissions that week). I am back on the horse next week---Thanks for bringing this back into my frame of reference.

dschwilk commented 7 years ago

The current commit (edecb8e) , fits topo models using boosted regression trees. The model outputs, however, are not too useful (eg results/topo_mod_results/CM_tmin.txt) and lack summaries. Can we turn off the replication level detail and get summaries? I also see that there are no longer model diagnostic plots produced. So I'm having trouble trusting these new models if I have no evidence they work better than the random forest models. The model output streams now have over 40k lines and that is just more than I can process or want to save.

dschwilk commented 7 years ago

OK. so do we really want the train(verboseIter=TRUE) flag? See https://github.com/schwilklab/skyisland-climate/commit/778113d2d9a1bc96cc51b81711e7c1cb391b6342#diff-f29a543d0e09e7572db8cff0c0811429R91

I think we should drop that, then add better model summary diagnostics. But I'm not familiar enough with the model objects returned by this method yet. Ideas? It loks like print(res) where res is the model returned by train() is not getting us much. What do we need to save to the text files in top_mod_results? Then we need to save some more complete model summaries. This code goes in fitModelRunDiagnostics()

The current code runs and produces predictions of PC1, PC2 and PC3 for each mtn and variable (tmin and tmax). We just need some better recording of the model summaries.

dschwilk commented 7 years ago

@hpoulos, If you point me to the documentation for the model objects you are creating, I will figure out how to extract the summaries we want and print them to the screen (which is also directed to the text files in results/topo_mod_results/, see the sink() command). The easy part is running a model, the work is all the organizational coding and keeping track of results, etc. Can't be done by trial an error we need the documentation of the objects. Is this covered in caret docs?

hpoulos commented 7 years ago

http://topepo.github.io/caret/index.html is the documentation for Caret in R. Here is Max Kuhn's repo on Git Hub: https://github.com/topepo/caret Here are his two emails: personal email (mxkuhn@gmail.com), RStudio address (max@rstudio.com)

dschwilk commented 7 years ago

@hpoulos But the models themselves are xgboost Extreme Gradient Boosting models. Since these are complicated model objects I am saving the whole model in RDS files (each mtn by variable by axis combination). But we probably want to print out some summary statistics. Can you tell me what are most appropriate for such boosted regression tree models? I'll send the DM tmin PC1 model by email as an example for us to discuss. Looking at the model object, here are all the parts (long)!:

> str(mod1)
List of 23
 $ method      : chr "xgbTree"
 $ modelInfo   :List of 14
  ..$ label     : chr "eXtreme Gradient Boosting"
  ..$ library   : chr [1:2] "xgboost" "plyr"
  ..$ type      : chr [1:2] "Regression" "Classification"
  ..$ parameters:'data.frame':  7 obs. of  3 variables:
  .. ..$ parameter: Factor w/ 7 levels "colsample_bytree",..: 6 4 2 3 1 5 7
  .. ..$ class    : Factor w/ 1 level "numeric": 1 1 1 1 1 1 1
  .. ..$ label    : Factor w/ 7 levels "# Boosting Iterations",..: 1 2 5 3 7 4 6
  ..$ grid      :function (x, y, len = NULL, search = "grid")  
  .. ..- attr(*, "srcref")=Class 'srcref'  atomic [1:8] 13 26 34 19 26 19 13 34
  .. .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x1b489b78> 
  ..$ loop      :function (grid)  
  .. ..- attr(*, "srcref")=Class 'srcref'  atomic [1:8] 35 26 52 19 26 19 35 52
  .. .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x1b489b78> 
  ..$ fit       :function (x, y, wts, param, lev, last, classProbs, ...)  
  .. ..- attr(*, "srcref")=Class 'srcref'  atomic [1:8] 53 25 101 19 25 19 53 101
  .. .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x1b489b78> 
  ..$ predict   :function (modelFit, newdata, submodels = NULL)  
  .. ..- attr(*, "srcref")=Class 'srcref'  atomic [1:8] 102 29 137 19 29 19 102 137
  .. .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x1b489b78> 
  ..$ prob      :function (modelFit, newdata, submodels = NULL)  
  .. ..- attr(*, "srcref")=Class 'srcref'  atomic [1:8] 138 26 169 19 26 19 138 169
  .. .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x1b489b78> 
  ..$ predictors:function (x, ...)  
  .. ..- attr(*, "srcref")=Class 'srcref'  atomic [1:8] 170 32 173 19 32 19 170 173
  .. .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x1b489b78> 
  ..$ varImp    :function (object, numTrees = NULL, ...)  
  .. ..- attr(*, "srcref")=Class 'srcref'  atomic [1:8] 174 28 181 19 28 19 174 181
  .. .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x1b489b78> 
  ..$ levels    :function (x)  
  .. ..- attr(*, "srcref")=Class 'srcref'  atomic [1:8] 182 28 182 50 28 50 182 182
  .. .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x1b489b78> 
  ..$ tags      : chr [1:4] "Tree-Based Model" "Boosting" "Ensemble Model" "Implicit Feature Selection"
  ..$ sort      :function (x)  
  .. ..- attr(*, "srcref")=Class 'srcref'  atomic [1:8] 184 26 188 19 26 19 184 188
  .. .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x1b489b78> 
 $ modelType   : chr "Regression"
 $ results     :'data.frame':   4000 obs. of  11 variables:
  ..$ eta             : num [1:4000] 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ...
  ..$ max_depth       : int [1:4000] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ gamma           : num [1:4000] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ colsample_bytree: num [1:4000] 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 ...
  ..$ min_child_weight: num [1:4000] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ subsample       : num [1:4000] 0.5 0.556 0.611 0.667 0.722 ...
  ..$ nrounds         : num [1:4000] 50 50 50 50 50 50 50 50 50 50 ...
  ..$ RMSE            : num [1:4000] 0.00301 0.00282 0.00255 0.00289 0.00336 ...
  ..$ Rsquared        : num [1:4000] 0.907 0.908 0.925 0.926 0.902 ...
  ..$ RMSESD          : num [1:4000] 0.000909 0.001819 0.000877 0.001586 0.001459 ...
  ..$ RsquaredSD      : num [1:4000] 0.0746 0.0886 0.0172 0.0125 0.04 ...
 $ pred        : NULL
 $ bestTune    :'data.frame':   1 obs. of  7 variables:
  ..$ nrounds         : num 150
  ..$ max_depth       : int 1
  ..$ eta             : num 0.4
  ..$ gamma           : num 0
  ..$ colsample_bytree: num 0.8
  ..$ min_child_weight: num 1
  ..$ subsample       : num 0.722
 $ call        : language train.formula(form = formula, data = df, tuneLength = 10, method = "xgbTree",      trControl = trainControl(method = "cv", number = 5, preProc = c("center",  ...
 $ dots        : list()
 $ metric      : chr "RMSE"
 $ control     :List of 28
  ..$ method           : chr "cv"
  ..$ number           : num 5
  ..$ repeats          : num 1
  ..$ search           : chr "grid"
  ..$ p                : num 0.75
  ..$ initialWindow    : NULL
  ..$ horizon          : num 1
  ..$ fixedWindow      : logi TRUE
  ..$ skip             : num 0
  ..$ verboseIter      : logi FALSE
  ..$ returnData       : logi TRUE
  ..$ returnResamp     : chr "final"
  ..$ savePredictions  : chr "none"
  ..$ classProbs       : logi FALSE
  ..$ summaryFunction  :function (data, lev = NULL, model = NULL)  
  ..$ selectionFunction: chr "best"
  ..$ preProcOptions   : chr [1:2] "center" "scale"
  ..$ sampling         : NULL
  ..$ index            :List of 5
  .. ..$ Fold1: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..$ Fold2: int [1:29] 1 2 3 5 6 7 8 10 11 12 ...
  .. ..$ Fold3: int [1:28] 1 4 5 6 7 8 9 11 12 14 ...
  .. ..$ Fold4: int [1:29] 2 3 4 5 6 7 9 10 11 13 ...
  .. ..$ Fold5: int [1:28] 1 2 3 4 8 9 10 12 13 14 ...
  ..$ indexOut         :List of 5
  .. ..$ Resample1: int [1:6] 14 22 23 27 31 34
  .. ..$ Resample2: int [1:7] 4 9 19 21 26 28 30
  .. ..$ Resample3: int [1:8] 2 3 10 13 15 17 18 25
  .. ..$ Resample4: int [1:7] 1 8 12 20 24 33 35
  .. ..$ Resample5: int [1:8] 5 6 7 11 16 29 32 36
  ..$ indexFinal       : NULL
  ..$ timingSamps      : num 0
  ..$ predictionBounds : logi [1:2] FALSE FALSE
  ..$ seeds            :List of 6
  .. ..$ : int [1:400] 435024 595426 505745 856404 119689 825034 695500 439204 86251 294063 ...
  .. ..$ : int [1:400] 405534 424076 886955 376914 601969 99435 846053 304044 998967 389839 ...
  .. ..$ : int [1:400] 817448 790880 408757 766674 825105 341238 75133 614491 819769 449193 ...
  .. ..$ : int [1:400] 217706 824252 197438 169512 798066 12012 155612 325956 181950 180416 ...
  .. ..$ : int [1:400] 751366 310671 855214 512607 319229 853664 485708 985434 988507 58236 ...
  .. ..$ : int 634337
  ..$ adaptive         :List of 4
  .. ..$ min     : num 5
  .. ..$ alpha   : num 0.05
  .. ..$ method  : chr "gls"
  .. ..$ complete: logi TRUE
  ..$ trim             : logi FALSE
  ..$ allowParallel    : logi TRUE
  ..$ yLimits          : num [1:2] -0.181 -0.143
 $ finalModel  :List of 10
  ..$ handle     :Class 'xgb.Booster.handle' <externalptr> 
  ..$ raw        : raw [1:33816] 00 00 00 3f ...
  ..$ niter      : num 150
  ..$ call       : language xgb.train(params = list(eta = param$eta, max_depth = param$max_depth, gamma = param$gamma,      colsample_bytree = param$colsample_bytree, min_child_weight = param$min_child_weight,  ...
  ..$ params     :List of 8
  .. ..$ eta             : num 0.4
  .. ..$ max_depth       : int 1
  .. ..$ gamma           : num 0
  .. ..$ colsample_bytree: num 0.8
  .. ..$ min_child_weight: num 1
  .. ..$ subsample       : num 0.722
  .. ..$ objective       : chr "reg:linear"
  .. ..$ silent          : num 1
  ..$ callbacks  :List of 1
  .. ..$ cb.print.evaluation:function (env = parent.frame())  
  .. .. ..- attr(*, "call")= language cb.print.evaluation(period = print_every_n)
  .. .. ..- attr(*, "name")= chr "cb.print.evaluation"
  ..$ xNames     : chr [1:7] "elev" "ldist_ridge" "ldist_valley" "msd" ...
  ..$ problemType: chr "Regression"
  ..$ tuneValue  :'data.frame': 1 obs. of  7 variables:
  .. ..$ nrounds         : num 150
  .. ..$ max_depth       : int 1
  .. ..$ eta             : num 0.4
  .. ..$ gamma           : num 0
  .. ..$ colsample_bytree: num 0.8
  .. ..$ min_child_weight: num 1
  .. ..$ subsample       : num 0.722
  ..$ obsLevels  : logi NA
  ..- attr(*, "class")= chr "xgb.Booster"
 $ preProcess  : NULL
 $ trainingData:'data.frame':   36 obs. of  8 variables:
  ..$ .outcome    : num [1:36] -0.16 -0.16 -0.155 -0.156 -0.175 ...
  ..$ elev        : num [1:36] 2242 2229 2245 2227 1816 ...
  ..$ ldist_ridge : num [1:36] 28.9 119.2 155.7 163.6 40.9 ...
  ..$ ldist_valley: num [1:36] 302 202 209 185 145 ...
  ..$ msd         : num [1:36] 7 2 0 0 0 0 0 1 0 0 ...
  ..$ radiation   : num [1:36] 1702569 1803335 1353465 1558216 1848745 ...
  ..$ relev_l     : num [1:36] -273 -83.2 -52.8 -21.6 -103.7 ...
  ..$ slope       : num [1:36] 18.9 34 28.4 16.6 10.1 ...
 $ resample    :'data.frame':   5 obs. of  3 variables:
  ..$ RMSE    : num [1:5] 0.00249 0.00309 0.00156 0.00236 0.0012
  ..$ Rsquared: num [1:5] 0.923 0.957 0.941 0.981 0.982
  ..$ Resample: chr [1:5] "Fold4" "Fold2" "Fold3" "Fold5" ...
 $ resampledCM : NULL
 $ perfNames   : chr [1:2] "RMSE" "Rsquared"
 $ maximize    : logi FALSE
 $ yLimits     : num [1:2] -0.181 -0.143
 $ times       :List of 3
  ..$ everything:Class 'proc_time'  Named num [1:5] 1351.23 2.19 192.3 0 0
  .. .. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
  ..$ final     :Class 'proc_time'  Named num [1:5] 0.148 0.008 0.02 0 0
  .. .. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
  ..$ prediction: logi [1:3] NA NA NA
 $ levels      : logi NA
 $ terms       :Classes 'terms', 'formula'  language PC1 ~ elev + ldist_ridge + ldist_valley + msd + radiation + relev_l + slope
  .. ..- attr(*, "variables")= language list(PC1, elev, ldist_ridge, ldist_valley, msd, radiation, relev_l, slope)
  .. ..- attr(*, "factors")= int [1:8, 1:7] 0 1 0 0 0 0 0 0 0 0 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:8] "PC1" "elev" "ldist_ridge" "ldist_valley" ...
  .. .. .. ..$ : chr [1:7] "elev" "ldist_ridge" "ldist_valley" "msd" ...
  .. ..- attr(*, "term.labels")= chr [1:7] "elev" "ldist_ridge" "ldist_valley" "msd" ...
  .. ..- attr(*, "order")= int [1:7] 1 1 1 1 1 1 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: 0x9e28088> 
  .. ..- attr(*, "predvars")= language list(PC1, elev, ldist_ridge, ldist_valley, msd, radiation, relev_l, slope)
  .. ..- attr(*, "dataClasses")= Named chr [1:8] "numeric" "numeric" "numeric" "numeric" ...
  .. .. ..- attr(*, "names")= chr [1:8] "PC1" "elev" "ldist_ridge" "ldist_valley" ...
 $ coefnames   : chr [1:7] "elev" "ldist_ridge" "ldist_valley" "msd" ...
 $ xlevels     : Named list()
 - attr(*, "class")= chr [1:2] "train" "train.formula"

So is mod$results$Rsquared an appropriate summary? and what part? that is a vector. The mean. Sorry, I am only very vaguely familiar with these machine learning approaches and I am a bit lost. You mention in the comments in predict-spatial.R that you had trouble "printing" the Rsquared. What exactly did you try? I see a lot of slots in that object and we can print out whatever we want. I'm hoping you can tell me what is most useful.

For the model I will send by email, here is the mean:

> mean(mod1$results$Rsquared)
[1] 0.8931741
dschwilk commented 7 years ago

Perhaps it is the resample slots we want?

 $ resample    :'data.frame':   5 obs. of  3 variables:
  ..$ RMSE    : num [1:5] 0.00249 0.00309 0.00156 0.00236 0.0012
  ..$ Rsquared: num [1:5] 0.923 0.957 0.941 0.981 0.982
  ..$ Resample: chr [1:5] "Fold4" "Fold2" "Fold3" "Fold5" ...
dschwilk commented 7 years ago

I moved our record of current practice to the project README (see 693a6c18c378b5ba1357b1ff71ef99b9fdc28191). I will open a new issue that focuses on the problem I brought up in the last two comments.