error: `size` must be less or equal than ... (size of data), set `replace` = TRUE to use sampling with replacement

lulizou / boostme

R package to impute methylation within WGBS using machine learning

MIT License

10 stars 3 forks source link

error: `size` must be less or equal than ... (size of data), set `replace` = TRUE to use sampling with replacement #2

Closed MohamedRefaat92 closed 1 year ago

MohamedRefaat92 commented 5 years ago

Dear developers,

I am running the github version of boostme and I get the following error:

size must be less or equal than 180249 (size of data), set replace = TRUE to use sampling with replacement

Firstly, I would really appreciate if you could help me interpret the error. Secondly, I couldn't find arguments called 'size' or 'replace' in theboostme() function to set.

Any help would be appreciated.

Best, Mohamed Shoeb

lulizou commented 5 years ago

Hi Mohamed,

I think you may need to change the trainSize, validateSize, and testSize arguments to fit the size of your data (trainSize + validateSize + testSize <= total number of CpGs in your data). The default for these values is 1 million. I have just pushed a more informative error message for this. Hope that helps and let me know if there is still an error.

Best, Luli

MohamedRefaat92 commented 5 years ago

Hi Lui,

Thank you for your prompt response. I found that some of the samples are very small compared to the rest of the data. I will need to filter them before running boostme.

Another question, in the background section of the paper it's mentioned that multiple samples should belong to the same disease state. But in bsseq object contains methylation information for samples that belong to different disease states. Should this make a problem for the analysis?

Finally, I've dropped the value of min_cov argument to 2 because the data is very sparse. Do find this value, compared to the default value of 10, appropriate?

Best regards, Mohamed Shoeb

lulizou commented 5 years ago

Hi Mohamed,

For the first question, if the goal is to look for differences among different disease states, it may pose an issue for analysis, since boostme uses the average across all samples as a predictor. This would diminish any difference you might be looking for. One possible solution is to use sampleAvg = F so that the average is not used. For the second question, I'd expect lower accuracy using min_cov = 2 vs. 10. Whether or not it is appropriate depends on your tolerance for error in the imputed values. Hope this helps!

Best, Luli

MohamedRefaat92 commented 5 years ago

Hi Luli,

Thank you for your answers. Regarding the first answer, I believe that how a BSseq object is constructed requires all the samples from different disease states to be in the same methylation and coverage matrix, where every column represent a single sample. The disease state of each sample is saved in the phenotype table, which can be accessed using pData().

Would you suggest that I start by subsetting the BSseq object based on the disease state, then run boostme() separately on each subset? and will I be able to combine the results from the two subsets into a single BSseq object for downstream differential methylation analysis?

Best, Mohamed Shoeb

lulizou commented 5 years ago

Hi Mohamed,

Running separately on each subset is a good idea. Putting the results back into a single BSseq object is difficult, since BSseq objects require a coverage value from which the % methylation is calculated. Imputation predicts the % value but does not provide a coverage value. You could put in "dummy" coverage values reflecting your confidence in the estimates, but I have not experimented with how this would affect downstream applications.

Best, Luli

MohamedRefaat92 commented 5 years ago

Hi Luli,

I've run boostmewith different values of minimum coverage (1 and 3) to see the effect it has on the imputed matrix. That being said, I would like to ask about the reason that the imputed matrices still contain NAN values? Is this normal for highly sparse datasets?

Best, Mohamed

lulizou commented 5 years ago

Hi Mohamed,

They will still contain NaN values where at least one feature used in the model was NaN. This is probably normal for highly sparse datasets since neighboring CpGs/other samples also have a higher chance of being NaN.

Best, Luli