Closed asierrl closed 2 years ago
@asierrl Thank you for using our package and bringing this to our attention! I am traveling right now for the Holidays but as soon as I have a chance, I will take a look at this in more detail.
A few followup questions: 1) does this only happen when using more than 2 groupings? I am not sure I noticed this with our two-group examples. Could you try the same dataset again with just 2? Or better yet, do you see this same behavior in our vignettes? 2) what approach for grouping were you using? 3) do you have sufficient sample size by each group?
As I mention I will take a look and let you know what I find out.
I have recreated what you found using the dplyr method.
When using our base R implementation
split_data2 <- spectratrait::create_data_split(dataset=df, approach="base",
split_seed=7529075, prop=0.8,
group_variables=c("plot", "season", "disease"))
split_data2
split_data2$cal_data$id
split_data2$val_data$id
I get the expected behavior, sampling across the data range.
> split_data2$cal_data$id
[1] 1 2 4 5 6 7 9 10 11 12 13 14 15 16 18 19 20 21 23 25 26 27 28 29 30 32 33 34
[29] 35 36 37 38 39 40 41 42 43 44 46 47 48 49 50 51 52 53 54 55 56 57 58 59 61 62 63 65
[57] 66 67 68 69 70 71 72 75 76 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
[85] 98 99 101 102 103 104 105 106 107 109 110 111 112 114 115 117 118 119 120 121 122 124 125 126
> split_data2$val_data$id
[1] 3 8 17 22 24 31 45 60 64 73 74 77 97 100 108 113 116 123
If i change it to select 60% for training
> split_data2$cal_data$id
[1] 4 5 7 8 10 11 12 13 14 15 16 17 19 21 22 24 25 30 32 33 38 39 41 42 43 44 46 47
[29] 49 50 51 53 54 56 57 58 60 67 69 70 71 72 73 74 75 77 78 82 85 86 87 88 89 91 92 93
[57] 94 96 97 98 101 105 108 111 112 113 114 115 119 120 122 124
> split_data2$val_data$id
[1] 1 2 3 6 9 18 20 23 26 27 28 29 31 34 35 36 37 40 45 48 52 55 59 61 62 63 64 65
[29] 66 68 76 79 80 81 83 84 90 95 99 100 102 103 104 106 107 109 110 116 117 118 121 123 125 126
Looks like the dplyr implementation may not be parsing the groups properly anymore and it could be related to updates in tidy. I will look into making that work as expected.
@neo0351 @JulienLamour thoughts? It is in the split function here: https://github.com/TESTgroup-BNL/spectratrait/blob/4e472c6ae0e5833faae7d2f0f2b2fd9a2b48c196/R/create_data_split.R#L48. We need to isolate why the dplyr grouping is resulting in duplicated samples in cal/val and also reverts to selecting the last chunk of the full dataset. At least at the moment that seems to be whats going on.
I see what the issue is. When the dplyr approach was setup, we used
to separate val from cal but for whatever reason, the issue I see now is that
row.names(cal.plsr.data)
is getting the row names of the new data frame, i.e. restarted numbering and does not pertain to the original row numbers of "dataset" (the input). Thus we are getting repeats and duplications. Instead we need to keep an id column from the start and use that to find the non-overlaps then this approach will work fine.
This isnt it exactly but something like this
`%nin%` <- Negate(`%in%`)
df <- df %>% mutate(ids=row_number())
cal.plsr.data <- df %>%
group_by_at(vars(all_of(groupings))) %>%
slice(sample(1:n(), prop*n())) %>%
data.frame()
cal.plsr.data
val.plsr.data <-df[df$ids %nin% cal.plsr.data$ids,]
cal.plsr.data$ids
val.plsr.data$ids
> cal.plsr.data$ids
[1] 31 1 37 25 13 8 32 20 26 2 27 9 33 3 15 4 40 10 22 28 41 17 29 23 11 24 30 12
[29] 18 36 43 73 61 55 67 74 44 80 56 62 57 69 63 75 81 76 52 64 58 46 53 65 77 59 71 48
[57] 60 78 84 72 91 115 103 85 109 116 98 110 86 104 123 99 93 87 117 100 106 118 112 94 101 107 89 113
[85] 95 120 96 90 102 126
> val.plsr.data$ids
[1] 5 6 7 14 16 19 21 34 35 38 39 42 45 47 49 50 51 54 66 68 70 79 82 83 88 92 97 105
[29] 108 111 114 119 121 122 124 125
I think you don't need a response to your questions any more, but just in case... 1) Yes, I find the same issue with a two variable grouping split, and even with a one-variable split. 2) I just implemented the tutorial you provide in your article at Journal of Experimental Botany, and I find the same issue there with only a grouping variable. 3) Sample size is scarce (7 trees at each combination of grouping variables). I tried again with a proportion of 0.6, which should ensure at least 2 samples per permutation in each combination. Same result.
Implementing the fix
> library(spectratrait)
>
> plot<- rep(c("plot1", "plot2", "plot3"),each=42)
> season<- rep(1:6, 21)
> disease<- c(rep(0,84), rep(1,42))
> d<- seq(1:126)
> df <- data.frame(plot,season,disease,d)
> df <- df %>% mutate(id=row_number())
> names(df)
[1] "plot" "season" "disease" "d" "id"
>
> split_data <- spectratrait::create_data_split(dataset=df, approach="dplyr",
+ split_seed=7529075, prop=0.8,
+ group_variables=c("plot", "season", "disease"))
>
> #split_data
> split_data$cal_data$id
[1] 7 19 13 25 37 14 38 20 32 26 3 21 15 9 27 10 34 16 4 40 11 35 5 17 41 12 36 6 18 42 73 55 67 49 43
[36] 44 68 56 62 50 69 75 51 45 63 52 58 82 76 70 53 71 59 77 47 54 48 66 84 72 85 109 121 91 103 86 116 98 110 92
[71] 93 123 117 99 111 112 118 94 88 100 89 113 101 95 107 102 90 120 126 114
> split_data$val_data$id
[1] 1 2 8 22 23 24 28 29 30 31 33 39 46 57 60 61 64 65 74 78 79 80 81 83 87 96 97 104 105 106 108 115 119 122 124
[36] 125
> which(split_data$cal_data$id %in% split_data$val_data$id)
integer(0)
I will be updating the vignettes and testing to make sure it holds for our examples.
@asierrl please see PR #89 which should address the issues with create_data_split. I also added a test to ensure its not duplicating any obs between cal and val. As far as I can see, the fix has addressed your concerns but if you can please run this branch and check for me. Thanks
@serbinsh I already check the new version in branch data_split_fix and now it works smoothly, the validation data set now is correctly stratified across the three factors. Thanks for sharing your great work!!
Excellent @asierrl thanks for the help! I will merge that into the main branch now.
Fix merged - closing this issue
Aparently, the function performs a stratified random sampling for cal_data, but always takes the last rows of the dataset for val_data, even reusing cases already selected for cal_data: