dschwilk commented 8 years ago

Overall notes:

I am using R terminology of "scores" and "loadings"

Right now the code does not deal with poor models well. If no model is good fit then we should use the null for prediction (eg reconstruct scores for all dates as mean axis score and loadings for all locations as mean axis loading). We need to change this.

I have somewhat arbitrarily chosen the following topo variables:

c("elev","ldist_ridge" , "ldist_valley",  "msd", "radiation","relev_l", "slope")

zdist_ridge and zdist_valley were too tightly correlated with elevation to be useful.

Notes on spatial models (random forest)

script is predict-spatial.R This was random-forest.R
First two principal components axes seem to capture all spatial varation in tmax for each mtn range
tmin may need three
Notes on temporal models
script is predict-temporal.R
One or at maximum two axes (PC1 and PC2 scores) have any relationship with wx station data.
Some questions
1. Should we do simple model selection? nd automaticaly pick best for each prediction combination?
2. Should we continue to model each mtn range independently.
3. Right now, we assume spatial component is completely time invariant (we run PCA on all dates) We had talked about splitting into 2 seasons. But this does add to modeling. Right now spatial component prediction requires at least 12 models: 3 mtn ranges * 2 variables (tmin/tmax) * 2 PC axes. Adding seasonality would double that. But one option would be to run PCA on multiple mtn ranges at once. Any ideas or have you played around with this?

hpoulos commented 7 years ago

I just ran all of the scripts on skyisland-climate and there are a few errors.

In predict-spatial.R and a couple of other scripts (microclimate-topo-PCA.R), I get the following error message in Rstudio

Error in if (time > data.time) { : missing value where TRUE/FALSE needed

Especially here in the microclimate-topo-PCA.R PCAs <- loadPCAData()

Which means that I then don't have the PCAs object to run the two predict scripts.

hpoulos commented 7 years ago

Right now the RF model uses defaults. A couple of things can influence model fits, and varying the tuning parameters can improve the model fits. For example, we can vary the number of trees generated, the mtry value (the number of splitter variables tried at each split), and we can also use only the top performing predictor variables.

Another element that's in the scripts is whether or not to use training and test data to evaluate model performance. Originally, we had decided not to split the data because of the small number of ibuttons in each mountain range. That means that we cannot generate ROC or AUC statistics to estimate the model fits. This may be important for evaluating individual model performance.

Another thing I am wondering (but can't look at myself until the scripts run for me) is whether poorly fitting models are far from the envelope of factor loadings. Your suggestion to subset the climate data may be labor intensive, but using all measurements to generate a single PCA may lead to poor model fits. Perhaps we should just try generating a PCA for Jan tmin for one of the mountain ranges and seeing how that influences the RF model fit and then the subsequent termporal backcasting.