sdechaumet / ramopls

Other
2 stars 2 forks source link

Subsampling guidance #4

Open Althalis opened 4 months ago

Althalis commented 4 months ago

Dear Sylvain, first of all, thank you for your time again. I managed to run_AMOPLS, but had to include undersampling as my dataset is not balanced. And it worked only with subsampling = 1 (which I don't know if it is 'correct' to do).

I was wondering where I can find guidance for the nb_perm, subsampling and parallel parameters.

Below are some details for when it didn't work (with the values used in your examples). I understand that my dataset is particularly big, so maybe that's why.

[Also another note that could be useful to note for other users, is that I needed to convert my data.frames into data.tables for things to run. Just FYI]

Thank you, best Julie

dim(dm_jc) [1] 44 6402 dim(smd_jc) [1] 44 3 result_unbalanced <- run_AMOPLS(datamatrix = dm_jc, samplemetadata = smd_jc, factor_names = c("Group","Isolation_method"), nb_perm = 100, subsampling = 10, parallel = 3) Data are unbalanced in: Group Isolation_method Group x Isolation_method Data are unbalanced, running stratified subsampling. Run sub-sampling: 1 Error in svd(data) : infinite or missing values in 'x'

Althalis commented 2 months ago

Dear Sylvain, any chance that you could give me some input here? That would be most helpful. Thank you

sdechaumet commented 2 months ago

Hi @Althalis

Sorry for the delay. The subsampling method has been implemented according to Boccard et al., 2019 (full reference below). You can find more information on interpreting subsampled results in the article.

Subsampling should be performed several times to ensure robustness against outliers or extreme samples. However, in your case, it fails at the first iteration for unknown reasons. One possible reason could be the minimal number of samples in the smallest groups: subsampling reduces the number of samples in each group (factor) to match the smallest group for a balanced design. If you use interactions, the smallest number of samples is calculated for each combination of your factors. It will fail if any combination has fewer than 3 samples.

Without seeing your data, it’s difficult to provide more detailed insights. In your session, with the data named as shown, please run the following command to check the design:

temp <- as.data.table(smd_jc)
temp_design <- temp[, .N, by = .(Group, Isolation_method)]
print(temp_design)
if (temp_design[N < 3, .N] > 0) {
  message("Not enough samples in the following groups:")
  print(temp_design[N < 3, ])
}

If possible, you can make the data available to me so I can check where the problem might be.

Thank you for the note on data.frames!

All the best,

Sylvain

Boccard J, Tonoli D, Strajhar P, Jeanneret F, Odermatt A, Rudaz S (2019) Removal of batch effects using stratified subsampling of metabolomic data for in vitro endocrine disruptors screening. Talanta 195: 77–86