serbinsh commented 4 years ago

Both approaches are less than desirable for different reasons

1) pls jackknife selection often results in more components than necessary

2) T.test permutation requires a large enough number of iterations otherwise the auto-select chooses far too few. Or if there is a lot of variance in the data it can result in the selection of a small number of components. For example with the https://ecosis.org/package/leaf-reflectance-plant-functional-gradient-ifgg-kit dataset. My tests keep selected 2 components which isnt correct. I wonder if we can refine how we find the minimum?

serbinsh commented 4 years ago

I will say that the built in PLS package method seems to work much better with the https://ecosis.org/package/leaf-reflectance-plant-functional-gradient-ifgg-kit dataset. Perhaps this is a case for when to use one or the other....for very large training datasets I think out method may be better

JulienLamour commented 4 years ago

In the current code, you are looking for the first component n for which the decrease in PRESS is not significant when compared to the component n-1. So it is looking for the first plateau. Sometimes the first plateau does not correspond to the best model. I wrote another version where you compare the PRESS corresponding to the best model (ie the number of component with the lowest PRESS) and find the smaller number of components with no stastistical differnce.. It will probably find a higher number. I guess the optimal number of components will always be subjective.. `mean_PRESS_comp=apply(X = pressDF,MARGIN = 2,FUN = mean) best_model=which.min(mean_PRESS_comp) results <- as.vector(array(data="NA", dim=c(best_model-1,1))) for (i in seq_along(1:(best_model-1))) { comp1 <- i; comp2 <- best_model ttest <- t.test(pressDFres$value[which(pressDFres$variable==comp1)], pressDFres$value[which(pressDFres$variable==comp2)])

print(i)

results[i] <- round(unlist(ttest$p.value),8) } results <- data.frame(seq(1,best_model-1,1),results) names(results) <- c("Component", "P.value") results

first <- min(which(as.numeric(as.character(results$P.value)) > 0.05)) nComps <- results$Component[first] print(paste0("*** Optimal number of components based on t.test: ", nComps)) `

serbinsh commented 4 years ago

@JulienLamour thanks. I just coded this up. A comparison using the "expanded_spectra-trait_reseco_lma_plsr_example.R" example.

pls: 11 pls

custom, seg=50, iterations=50: 12 custom

lowestPRESS (new), seg=50, iterations=50: 12 components

I need to find a different example to test these lol

JulienLamour commented 4 years ago

Ok, I was thinking about it, actually to take into account the number of components which adds complexity to the model, we could do a F test instead of a t test so the complexity of the model is taken into account. I am trying to find a simple solution to do that

serbinsh commented 4 years ago

copy that

serbinsh commented 4 years ago

Overall I think these options are fine as is, as options. Might just want to explore the Ftest as one more option. I think the biggest issue is with the very large datasets where PLS over estimates components. I suppose I need to test that one (NEON).

Using "expanded_spectra-trait_kit_lma_plsr_example.R". A little more interesting

pls: 10 pls

custom: 11 custom

lowestPRESS: 13

JulienLamour commented 4 years ago

Ok, it is not possible to compare the models using a F test as I thought it was. There are a lot of discussion in how to compare the models with the different number of components. See for example https://arxiv.org/pdf/1810.08104.pdf . The package plsdof allows to compare more rigorously different models. For now we can probably keep the actual tests..

serbinsh commented 4 years ago

expanded_spectra-trait_neon_lma_plsr_example.R

pls: 18 neon_pls

custom: 14 neon_custom

lowestPRESS: 17 neon_lowestPRESS

serbinsh commented 4 years ago

@JulienLamour Yeah. I think as you say, and as I have noticed in the past, there may not be a "silver bullet" for component selection and its somewhat case-by-case. See above for example with the large NEON data the best approach is the basic permutation approach because PLS and the lowest PRESS comparison overfits. I think we can call it good at this point....not sure if we are going to come up with something better right now.

serbinsh commented 4 years ago

One more change - should probably call the three approaches: "pls", "firstMin", "lowestPRESS" instead of pls, custom, lowestPRESS. Thoughts @neo0351 ?

JulienLamour commented 4 years ago

I think the names should be 'pls', 'firstPlateau','firstMin'

serbinsh commented 4 years ago

@JulienLamour OK perfect. OK so one more set of updates to make these changes and then i think the scripts are done. I will need to re-run the vignettes but otherwise good to go.

JulienLamour commented 4 years ago

You went very quickly for me those last days, I havent checked in detail the code. Meaning that I ran the code and I saw that there is no bug but for example I didnt check the outputs, coefficients, things like that. But I guess you already did it?

serbinsh commented 4 years ago

@JulienLamour my review hasnt raised any flags as the outputs look correct to me. Take a look at the README page and the URLs that take you to pre-baked output. Or if you get a chance please run some of the examples. But I think we are good

neo0351 commented 4 years ago

@JulienLamour Could you give me a quick summary of the three methods for the readme? Maybe a sentence for each?

JulienLamour commented 4 years ago

The method from the pls package consists of choosing the model with fewest components that is still less than one standard error away from the overall best model. The method 'first min' consists of choosing the first component that gives statistically (t-test) the same result as the following component. This method finds the first 'plateau' in the PRESS diminution. The last method 'lowestPRESS' finds the first component that gives statistically (t-test) the same result as the overall best model.

JulienLamour commented 4 years ago

To answer Shawn's comment, the method 'firstMin' (ie the method previously called 'custom') should maybe renamed 'firstplateau' since it doesn't look for the minimum.. THe method 'lowestPress' could be actually renamed 'firstMin'. So it would be :

The method from the pls package consists of choosing the model with fewest components that is still less than one standard error away from the overall best model. The method 'first plateau' consists of choosing the first component that gives statistically (t-test) the same result as the following component. This method finds the first 'plateau' in the PRESS diminution. The last method 'firstmin' finds the first component that gives statistically (t-test) the same result as the overall best model.

neo0351 commented 4 years ago

@serbinsh How shall we proceed? Are we renaming 'custom'?

serbinsh commented 4 years ago

@neo0351 I am working on revisions that rename the functions to pls, firstMin, first plateau

neo0351 commented 4 years ago

@JulienLamour @serbinsh is this now correct? 'pls' option chooses the model with fewest components that is still less than one standard error away from the overall best model. 'firstPlateau' option chooses the first component that gives statistically (t-test) the same result as the following component. This method finds the first 'plateau' in the PRESS diminution. 'firstMin' finds the first component that gives statistically (t-test) the same result as the overall best model.

serbinsh commented 4 years ago

Confirmed that update component selection code is much faster with NEON!

serbinsh commented 4 years ago

6 first plateau, 15 first min, 18 pls package

now the default PLS version is much slower with NEON than our methods.....interesting

serbinsh commented 4 years ago

also if we wanted to we could use map or other parallel loop with our approaches since each is independent. for.each() for each random subset of the data for example

JulienLamour commented 4 years ago

Its weird, first plateau and first min should use the same time.

serbinsh commented 4 years ago

@JulienLamour why? I thought first plateau was the first time the next component is not sig less PRESS than proceeding but that can be a local minimum and firstMin is when we hit the first min that is similar to the overall minimum? There have been lots of cooks and changes so why dont you actually run the code and look at what its generating and make sure we are all on the same page.

JulienLamour commented 4 years ago

Basically the computation time is needed to perform all the pls and the prediction. The computing time is the same for both approaches since they do the same thing. Then the calculation of the number of components should be nearly instantaneous

serbinsh commented 4 years ago

...and if they should be the same then why do we have the two different options? My understanding was that plateau was based on my original t.test of the vector of components where it finds when the next highest component press isnt sig better than the previous. but we were concerned that was too conservative.

Wait now I dont know what you mean @JulienLamour I am talking about the number of components selected not computation time

JulienLamour commented 4 years ago

Oh I see!! My bad

serbinsh commented 4 years ago

yeah now the slowest method is the default method in the PLS package whereas yes our two approaches are the same in terms of computation time!

serbinsh commented 4 years ago

Closing this since we have achieved the goal laid out in this issue

plantphys / spectratrait

Refine model component selection #28

print(i)