Closed serbinsh closed 4 years ago
I will say that the built in PLS package method seems to work much better with the https://ecosis.org/package/leaf-reflectance-plant-functional-gradient-ifgg-kit dataset. Perhaps this is a case for when to use one or the other....for very large training datasets I think out method may be better
In the current code, you are looking for the first component n for which the decrease in PRESS is not significant when compared to the component n-1. So it is looking for the first plateau. Sometimes the first plateau does not correspond to the best model. I wrote another version where you compare the PRESS corresponding to the best model (ie the number of component with the lowest PRESS) and find the smaller number of components with no stastistical differnce.. It will probably find a higher number. I guess the optimal number of components will always be subjective.. `mean_PRESS_comp=apply(X = pressDF,MARGIN = 2,FUN = mean) best_model=which.min(mean_PRESS_comp) results <- as.vector(array(data="NA", dim=c(best_model-1,1))) for (i in seq_along(1:(best_model-1))) { comp1 <- i; comp2 <- best_model ttest <- t.test(pressDFres$value[which(pressDFres$variable==comp1)], pressDFres$value[which(pressDFres$variable==comp2)])
results[i] <- round(unlist(ttest$p.value),8) } results <- data.frame(seq(1,best_model-1,1),results) names(results) <- c("Component", "P.value") results
first <- min(which(as.numeric(as.character(results$P.value)) > 0.05)) nComps <- results$Component[first] print(paste0("*** Optimal number of components based on t.test: ", nComps)) `
@JulienLamour thanks. I just coded this up. A comparison using the "expanded_spectra-trait_reseco_lma_plsr_example.R" example.
pls: 11
custom, seg=50, iterations=50: 12
lowestPRESS (new), seg=50, iterations=50: 12 components
I need to find a different example to test these lol
Ok, I was thinking about it, actually to take into account the number of components which adds complexity to the model, we could do a F test instead of a t test so the complexity of the model is taken into account. I am trying to find a simple solution to do that
copy that
Overall I think these options are fine as is, as options. Might just want to explore the Ftest as one more option. I think the biggest issue is with the very large datasets where PLS over estimates components. I suppose I need to test that one (NEON).
Using "expanded_spectra-trait_kit_lma_plsr_example.R". A little more interesting
pls: 10
custom: 11
lowestPRESS: 13
Ok, it is not possible to compare the models using a F test as I thought it was. There are a lot of discussion in how to compare the models with the different number of components. See for example https://arxiv.org/pdf/1810.08104.pdf . The package plsdof allows to compare more rigorously different models. For now we can probably keep the actual tests..
expanded_spectra-trait_neon_lma_plsr_example.R
pls: 18
custom: 14
lowestPRESS: 17
@JulienLamour Yeah. I think as you say, and as I have noticed in the past, there may not be a "silver bullet" for component selection and its somewhat case-by-case. See above for example with the large NEON data the best approach is the basic permutation approach because PLS and the lowest PRESS comparison overfits. I think we can call it good at this point....not sure if we are going to come up with something better right now.
One more change - should probably call the three approaches: "pls", "firstMin", "lowestPRESS" instead of pls, custom, lowestPRESS. Thoughts @neo0351 ?
I think the names should be 'pls', 'firstPlateau','firstMin'
@JulienLamour OK perfect. OK so one more set of updates to make these changes and then i think the scripts are done. I will need to re-run the vignettes but otherwise good to go.
You went very quickly for me those last days, I havent checked in detail the code. Meaning that I ran the code and I saw that there is no bug but for example I didnt check the outputs, coefficients, things like that. But I guess you already did it?
@JulienLamour my review hasnt raised any flags as the outputs look correct to me. Take a look at the README page and the URLs that take you to pre-baked output. Or if you get a chance please run some of the examples. But I think we are good
@JulienLamour Could you give me a quick summary of the three methods for the readme? Maybe a sentence for each?
The method from the pls package consists of choosing the model with fewest components that is still less than one standard error away from the overall best model. The method 'first min' consists of choosing the first component that gives statistically (t-test) the same result as the following component. This method finds the first 'plateau' in the PRESS diminution. The last method 'lowestPRESS' finds the first component that gives statistically (t-test) the same result as the overall best model.
To answer Shawn's comment, the method 'firstMin' (ie the method previously called 'custom') should maybe renamed 'firstplateau' since it doesn't look for the minimum.. THe method 'lowestPress' could be actually renamed 'firstMin'. So it would be :
The method from the pls package consists of choosing the model with fewest components that is still less than one standard error away from the overall best model. The method 'first plateau' consists of choosing the first component that gives statistically (t-test) the same result as the following component. This method finds the first 'plateau' in the PRESS diminution. The last method 'firstmin' finds the first component that gives statistically (t-test) the same result as the overall best model.
@serbinsh How shall we proceed? Are we renaming 'custom'?
@neo0351 I am working on revisions that rename the functions to pls, firstMin, first plateau
@JulienLamour @serbinsh is this now correct? 'pls' option chooses the model with fewest components that is still less than one standard error away from the overall best model. 'firstPlateau' option chooses the first component that gives statistically (t-test) the same result as the following component. This method finds the first 'plateau' in the PRESS diminution. 'firstMin' finds the first component that gives statistically (t-test) the same result as the overall best model.
Confirmed that update component selection code is much faster with NEON!
6 first plateau, 15 first min, 18 pls package
now the default PLS version is much slower with NEON than our methods.....interesting
also if we wanted to we could use map or other parallel loop with our approaches since each is independent. for.each() for each random subset of the data for example
Its weird, first plateau and first min should use the same time.
@JulienLamour why? I thought first plateau was the first time the next component is not sig less PRESS than proceeding but that can be a local minimum and firstMin is when we hit the first min that is similar to the overall minimum? There have been lots of cooks and changes so why dont you actually run the code and look at what its generating and make sure we are all on the same page.
Basically the computation time is needed to perform all the pls and the prediction. The computing time is the same for both approaches since they do the same thing. Then the calculation of the number of components should be nearly instantaneous
...and if they should be the same then why do we have the two different options? My understanding was that plateau was based on my original t.test of the vector of components where it finds when the next highest component press isnt sig better than the previous. but we were concerned that was too conservative.
Wait now I dont know what you mean @JulienLamour I am talking about the number of components selected not computation time
Oh I see!! My bad
yeah now the slowest method is the default method in the PLS package whereas yes our two approaches are the same in terms of computation time!
Closing this since we have achieved the goal laid out in this issue
Both approaches are less than desirable for different reasons
1) pls jackknife selection often results in more components than necessary
2) T.test permutation requires a large enough number of iterations otherwise the auto-select chooses far too few. Or if there is a lot of variance in the data it can result in the selection of a small number of components. For example with the https://ecosis.org/package/leaf-reflectance-plant-functional-gradient-ifgg-kit dataset. My tests keep selected 2 components which isnt correct. I wonder if we can refine how we find the minimum?