plantphys / spectratrait

A tutorial R package for illustrating how to fit, evaluate, and report spectra-trait PLSR models. The package provides functions to enhance the base functionality of the R pls package, identify an optimal number of PLSR components, standardize model validation, and vignette examples that utilize datasets sourced from EcoSIS (ecosis.org)
GNU General Public License v3.0
12 stars 9 forks source link

Bug: need to generalize assumptions in base R data split #27

Closed serbinsh closed 4 years ago

serbinsh commented 4 years ago

@neo0351

FYI -

> apply(plsr_data[, split_var], MARGIN = 1, FUN = function(x) paste(x, collapse = " "))
Error in apply(plsr_data[, split_var], MARGIN = 1, FUN = function(x) paste(x,  : 
  dim(X) must have a positive length

I found a bug in your split function when trying out a new dataset https://ecosis.org/package/leaf-reflectance-plant-functional-gradient-ifgg-kit ID: 3cf6b27e-d80e-4bc7-b214-c95506e46daa

Not yet sure what the issue is but it looks at as coded it assumes there will be two grouping variables. We need this to be flexible enough to handle 1+ grouping variables

And also FYI - if I try with two grouping vars I get this

> split_data <- create_data_split(approach=method, split_seed=2356812, prop=0.8, 
+                                 group_variables=c("Growth_Form","Plant_Species"))
NA   Cal: 79.9729364005413%
 Not enough observations

Again these functions need to be general to allow for flexibility. We will need to fix this to allow for different numbers of grouping variables with different numbers of obs.

serbinsh commented 4 years ago

Pretty sure this is the area causing problems

create_data_split <- function(approach=NULL, split_seed=123456789, prop=0.8,
                              group_variables=NULL) {
  set.seed(split_seed)
  if(!is.null(approach)) {
    if (approach=="base") {
      plsr_data$CalVal <- NA
      split_var <- group_variables
      plsr_data$ID <- apply(plsr_data[, split_var], MARGIN = 1, FUN = function(x) paste(x, collapse = " "))
neo0351 commented 4 years ago

can you email me the data set you are using? I agree, this needs to work on any data set.

serbinsh commented 4 years ago

I provided the EcoSIS link to the dataset? Do you actually need me to email it to you?

neo0351 commented 4 years ago

Oh, sorry, missed that you linked another data set

neo0351 commented 4 years ago

@serbinsh Ok, I see the problem. It's the grep looking for NA's. Which will cause issues. I've solved this once upon a time and don't remember the fix right now. The real question is, how do we want to handle say species that has NA's in it? Drop them? Leave them?

neo0351 commented 4 years ago

@serbinsh What do we want to do with NA's in the grouping variables?

serbinsh commented 4 years ago

We dont want to drop NA's. I think we may just have to shunt them as another "factor" that is pool all NA's together. What do you think? The risk is dropping NAs could mean we drop a lot of good data for training

neo0351 commented 4 years ago

Good point. I'll keep them in. We should check Julien's method and see what it does.

neo0351 commented 4 years ago

I'll get you the fix after lunch. I think I need to update all the things before submitting the fix.

serbinsh commented 4 years ago

Fixed by @neo0351