Possible bug in create_data_split when working with groups

asierrl commented 2 years ago

Aparently, the function performs a stratified random sampling for cal_data, but always takes the last rows of the dataset for val_data, even reusing cases already selected for cal_data:

plot<- rep(c("plot1", "plot2", "plot3"),each=42) season<- rep(1:6, 21) disease<- c(rep(0,84), rep(1,42)) d<- seq(1:126) df <- data.frame(plot,season,disease,d) df <- df %>% mutate(id=row_number()) names(df) [1] "plot" "season" "disease" "d" "id"
df$id [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 [91] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126

split_data <- spectratrait::create_data_split(dataset=df, approach="dplyr", split_seed=7529075, prop=0.8, group_variables=c("plot", "season", "disease"))

split_data$cal_data$id [1] 7 19 13 25 37 14 38 20 32 26 3 21 15 9 27 10 34 16 [19] 4 40 11 35 5 17 41 12 36 6 18 42 73 55 67 49 43 44 [37] 68 56 62 50 69 75 51 45 63 52 58 82 76 70 53 71 59 77 [55] 47 54 48 66 84 72 85 109 121 91 103 86 116 98 110 92 93 123 [73] 117 99 111 112 118 94 88 100 89 113 101 95 107 102 90 120 126 114 split_data$val_data$id [1] 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 [19] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126

table(split_data$cal_data$plot,split_data$cal_data$season, split_data$cal_data$disease) , , = 0 1 2 3 4 5 6 plot1 5 5 5 5 5 5 plot2 5 5 5 5 5 5 plot3 0 0 0 0 0 0 , , = 1 1 2 3 4 5 6 plot1 0 0 0 0 0 0 plot2 0 0 0 0 0 0 plot3 5 5 5 5 5 5

table(split_data$val_data$plot,split_data$val_data$season, split_data$val_data$disease) , , = 1
1 2 3 4 5 6 plot3 6 6 6 6 6 6

serbinsh commented 2 years ago

@asierrl Thank you for using our package and bringing this to our attention! I am traveling right now for the Holidays but as soon as I have a chance, I will take a look at this in more detail.

A few followup questions: 1) does this only happen when using more than 2 groupings? I am not sure I noticed this with our two-group examples. Could you try the same dataset again with just 2? Or better yet, do you see this same behavior in our vignettes? 2) what approach for grouping were you using? 3) do you have sufficient sample size by each group?

As I mention I will take a look and let you know what I find out.

serbinsh commented 2 years ago

I have recreated what you found using the dplyr method.

When using our base R implementation

split_data2 <- spectratrait::create_data_split(dataset=df, approach="base", 
                                              split_seed=7529075, prop=0.8, 
                                              group_variables=c("plot", "season", "disease"))

split_data2
split_data2$cal_data$id
split_data2$val_data$id

I get the expected behavior, sampling across the data range.

> split_data2$cal_data$id
  [1]   1   2   4   5   6   7   9  10  11  12  13  14  15  16  18  19  20  21  23  25  26  27  28  29  30  32  33  34
 [29]  35  36  37  38  39  40  41  42  43  44  46  47  48  49  50  51  52  53  54  55  56  57  58  59  61  62  63  65
 [57]  66  67  68  69  70  71  72  75  76  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96
 [85]  98  99 101 102 103 104 105 106 107 109 110 111 112 114 115 117 118 119 120 121 122 124 125 126

> split_data2$val_data$id
 [1]   3   8  17  22  24  31  45  60  64  73  74  77  97 100 108 113 116 123

If i change it to select 60% for training

> split_data2$cal_data$id
 [1]   4   5   7   8  10  11  12  13  14  15  16  17  19  21  22  24  25  30  32  33  38  39  41  42  43  44  46  47
[29]  49  50  51  53  54  56  57  58  60  67  69  70  71  72  73  74  75  77  78  82  85  86  87  88  89  91  92  93
[57]  94  96  97  98 101 105 108 111 112 113 114 115 119 120 122 124

> split_data2$val_data$id
 [1]   1   2   3   6   9  18  20  23  26  27  28  29  31  34  35  36  37  40  45  48  52  55  59  61  62  63  64  65
[29]  66  68  76  79  80  81  83  84  90  95  99 100 102 103 104 106 107 109 110 116 117 118 121 123 125 126

Looks like the dplyr implementation may not be parsing the groups properly anymore and it could be related to updates in tidy. I will look into making that work as expected.

@neo0351 @JulienLamour thoughts? It is in the split function here: https://github.com/TESTgroup-BNL/spectratrait/blob/4e472c6ae0e5833faae7d2f0f2b2fd9a2b48c196/R/create_data_split.R#L48. We need to isolate why the dplyr grouping is resulting in duplicated samples in cal/val and also reverts to selecting the last chunk of the full dataset. At least at the moment that seems to be whats going on.

serbinsh commented 2 years ago

I see what the issue is. When the dplyr approach was setup, we used

https://github.com/TESTgroup-BNL/spectratrait/blob/4e472c6ae0e5833faae7d2f0f2b2fd9a2b48c196/R/create_data_split.R#L52

to separate val from cal but for whatever reason, the issue I see now is that

row.names(cal.plsr.data)

is getting the row names of the new data frame, i.e. restarted numbering and does not pertain to the original row numbers of "dataset" (the input). Thus we are getting repeats and duplications. Instead we need to keep an id column from the start and use that to find the non-overlaps then this approach will work fine.

serbinsh commented 2 years ago

This isnt it exactly but something like this

`%nin%` <- Negate(`%in%`)
df <- df %>% mutate(ids=row_number())
cal.plsr.data <- df %>%
  group_by_at(vars(all_of(groupings))) %>% 
  slice(sample(1:n(), prop*n())) %>% 
  data.frame()
cal.plsr.data

val.plsr.data <-df[df$ids %nin% cal.plsr.data$ids,]

cal.plsr.data$ids
val.plsr.data$ids

> cal.plsr.data$ids
 [1]  31   1  37  25  13   8  32  20  26   2  27   9  33   3  15   4  40  10  22  28  41  17  29  23  11  24  30  12
[29]  18  36  43  73  61  55  67  74  44  80  56  62  57  69  63  75  81  76  52  64  58  46  53  65  77  59  71  48
[57]  60  78  84  72  91 115 103  85 109 116  98 110  86 104 123  99  93  87 117 100 106 118 112  94 101 107  89 113
[85]  95 120  96  90 102 126
> val.plsr.data$ids
 [1]   5   6   7  14  16  19  21  34  35  38  39  42  45  47  49  50  51  54  66  68  70  79  82  83  88  92  97 105
[29] 108 111 114 119 121 122 124 125

asierrl commented 2 years ago

I think you don't need a response to your questions any more, but just in case... 1) Yes, I find the same issue with a two variable grouping split, and even with a one-variable split. 2) I just implemented the tutorial you provide in your article at Journal of Experimental Botany, and I find the same issue there with only a grouping variable. 3) Sample size is scarce (7 trees at each combination of grouping variables). I tried again with a proportion of 0.6, which should ensure at least 2 samples per permutation in each combination. Same result.

serbinsh commented 2 years ago

Implementing the fix

> library(spectratrait)
> 
> plot<- rep(c("plot1", "plot2", "plot3"),each=42)
> season<- rep(1:6, 21)
> disease<- c(rep(0,84), rep(1,42))
> d<- seq(1:126)
> df <- data.frame(plot,season,disease,d)
> df <- df %>% mutate(id=row_number())
> names(df)
[1] "plot"    "season"  "disease" "d"       "id"     
> 
> split_data <- spectratrait::create_data_split(dataset=df, approach="dplyr", 
+                                               split_seed=7529075, prop=0.8, 
+                                               group_variables=c("plot", "season", "disease"))
> 
> #split_data
> split_data$cal_data$id
 [1]   7  19  13  25  37  14  38  20  32  26   3  21  15   9  27  10  34  16   4  40  11  35   5  17  41  12  36   6  18  42  73  55  67  49  43
[36]  44  68  56  62  50  69  75  51  45  63  52  58  82  76  70  53  71  59  77  47  54  48  66  84  72  85 109 121  91 103  86 116  98 110  92
[71]  93 123 117  99 111 112 118  94  88 100  89 113 101  95 107 102  90 120 126 114
> split_data$val_data$id
 [1]   1   2   8  22  23  24  28  29  30  31  33  39  46  57  60  61  64  65  74  78  79  80  81  83  87  96  97 104 105 106 108 115 119 122 124
[36] 125
> which(split_data$cal_data$id %in% split_data$val_data$id)
integer(0)

I will be updating the vignettes and testing to make sure it holds for our examples.

serbinsh commented 2 years ago

@asierrl please see PR #89 which should address the issues with create_data_split. I also added a test to ensure its not duplicating any obs between cal and val. As far as I can see, the fix has addressed your concerns but if you can please run this branch and check for me. Thanks

asierrl commented 2 years ago

@serbinsh I already check the new version in branch data_split_fix and now it works smoothly, the validation data set now is correctly stratified across the three factors. Thanks for sharing your great work!!

serbinsh commented 2 years ago

Excellent @asierrl thanks for the help! I will merge that into the main branch now.

serbinsh commented 2 years ago

Fix merged - closing this issue

plantphys / spectratrait

Possible bug in create_data_split when working with groups #87