nComp selection seems not to work properly

asierrl commented 2 years ago

I am trying to fit a PLSR model of hyperspectral (VNIR) data on SLA (mean of canopy values per tree) using psl and spectratrait and following the tutorial provided in it. But I find a lack of congruence between the results of spectratrait::find_optimal_components and what I see in the training, crossvalidation and validation results with different number of components. find_optimal_components always select 1 to 3 components (depending on the split proportion), but all three data sets seem to improve significantly at least until 8 components. imagen imagen imagen imagen

There is surely something I am missing, but I cannot get the reason why, but there seems to be a general agreement between train and validation results, while cross-validation results have a very different pattern. My data come from a complex experimental design with factors plot, disease and season. I use these factors for creating the validation data set...but it seems to me, that find_optimal_components does not account for this stratification. Could this be the reason for the differing behaviour of crossvalidation? Could I use spectratrait::create_data_split and then run the model iteratively for different splits to select the optimal number of components? Even disregarding this difference, seems that the optimal number of components should be something around 8 (if we use cross-validation as a reference), but the procedure always gives values between 1 and 3 (tried with different methods). May the lack of stratification in cross-validation be behind this weird behaviour? Best regards, Asier

asierrl commented 2 years ago

I made a script where, instead of using pls_permutate I run a loop where I split the data with spectratrait::create_data_split (with a pseudorandom seed in each loop), then fit the model to cal_data and then validate it on val_data. It always select a higher number of LVs and the resulting models make much more sense. Sitll, sometimes I perceived some issues that might be related with what I thing might be a bug in create_data_split function (I report it in a differenent issue).

serbinsh commented 2 years ago

I am not exactly following your comment. yes you can write your own code to permute the data, like a loop as you mentioned, but the point of this test/function is to use stats to determine optimal components. You can sometimes have the behavior you describe if the data is "noisy" In the plots you present above, I can see that indeed the components plateau at 3 because as you can see the distribution is not significantly better at 4. This is really just one way to try and objectively isolate optimal components without overfitting. One issue we have seen in the lit is significant overfitting because of looking for where the model improves the fit stats for cal, but this often leads to inflated accuracy.

You could also manually override the optimal comps in a script but as I can see in your nComp plot, you start to fit noise by component 7. Again I think you may have some obs in your dataset that are throwing off the fits and it would be worth looking at whether you have any highly leveraged outliers in the training data.

Also we provided 3 different methods for running that test, which did you try? All of them?

Finally, what did you mean be "It always select a higher number of LVs and the resulting models make much more sense." What does? And what is meant by "makes more sense" Again the function is conducting an objective selection based on the PRESS distributions by component. Yes you can force it to choose a higher number, its just meant to help you justify the selection or setup automated fitting workflows.

serbinsh commented 2 years ago

Also again looking at your plots, I am having a little bit of trouble understanding how cal always improves and val does not. What sort of dataset is this? Canopy spectra and traits? Are you able to share the data with me for testing? Even just a subset? I wont share it, its just challenging to see what the issue is from this post

I should also say that at the moment our function does not permute the cal and validate/select components based on the independent val data (those left out of the fitting) in order to keep them "independent". If you use the val data to select components then its not as objective of a selection procedure.

Also have you manually set the components and continued the workflow? how was the overall fit in the end?

asierrl commented 2 years ago

It seems that I did not explain the problem clearly enough. In fact, I followed closely your approach in the tutorial attached to Burnett et al. 2021. That is, first I splitted the data in a calibration and a validation data set. Then I selected the number of components by permuting on that calibration data set. Once set the number of components I fitted the final model with that nComps and a LOO cross-validation. And last I assessed the model by applying it to the validation data set. In all the response variables I tried I got a really really small number of components (1 to 3) and that opposite pattern of model LOO cross-validation as related to both calibration and validation (as shown in the plots above). The really small number of components puzzled me (as I did not expect a hyperspectrum to simplify to such a small number of components), but the always opposite patterns of cross-validation to both calibration and validation data sets made me think on some kind of error I was making. The only think I could find was that the calibration-validation data set was splitted by using my three grouping variables, but that structure was not used in the permutation and component selection procedure (and not, of course, in the LOO cross-validation). My trial to fix the issue did only alter the permutation and component selection procedure, were in each run I splitted the calibration data set in what I called a training and a cv data sets (by means of create_data_split). In each run, I fitted a model to this training data set and estimated RMSEP and R2 for the cv data set, saving these data. I run 100 iterations and estimated the number of components from the first minimum (but without the selection via p-values, just the first plain minimum). I find more sense in these results because the opposite trend between calibration and validation on one side and cross-validation mostly disappeared and because I found and expected trend of higher R2 on calibration, reduced in cross-validation and even smaller in validation. After this selection of optimal number of components I fitted the final model with the whole calibration data set and LOO cross-validation and then I evaluated it at the validation data set. I send you three files: the whole data set (data_example.csv), the calibration data set (cal_example.csv) and the validation data set (val_example.csv) after the splitting the data cal_example.csv data_example.csv val_example.csv

The response variable is SLA and my grouping variables are "Parcela", "Vuelo" and "Afeccion".

asierrl commented 2 years ago

After reading the comments to issue #87 the opposite trend between cal and val on one side and cross-val on the other disappears, and the comparison among the three plots make sense now (although I have to increase the number of samples at validation set, as a I only have 7 samples at each combination of grouping variables). Now I have a doubt about using the fixed version of dplyr or the base version when creating data splits...is there any fundamental difference among them?

serbinsh commented 2 years ago

Addressed in #87 and #93

serbinsh commented 2 years ago

This was fixed

plantphys / spectratrait

nComp selection seems not to work properly #86