perishky / meffil

Efficient algorithms for analyzing DNA methylation data.
Artistic License 2.0
53 stars 28 forks source link

Creating chunks for parallelization using `meffil.normalize.samples()` error #45

Closed hhp94 closed 1 year ago

hhp94 commented 1 year ago

Dear Dr. Suderman,

I found a bug while running meffil.normalize.samples() using multiple cores.

Error in read.idat(paste(basename, "_Grn.idat", sep = ""), verbose = verbose) : 
  Filename does not exist:./data/idats/<Sentrid_id>_Grn.idat.gz./data/idats/<Sentrid_id>_Grn.idat

So I dug a bit deeper and I think I might have found the problem.

I think the problem is because in ./R/mclapply.r the function mclapply.safe() is chunking the work into smaller chunks based on the argument "max.bytes". However, this means that for a sample, one _Grn.idat might end up in a chunk while the _Red.idat might end up in another chunk and this will raise an error in ./R/rg.r read.rg() function which calls read.idat(). Updated with EDIT in the below comment.

I haven't dug deep enough into the codes to understand fully how the chunking works. So if you were to implement a fix, what would you suggest is the strategy? I can try to implement it in my fork and test it out.

Because of this issue, and another issue - many academics and students use Windows machines, may I suggest that this is a good opportunity to also replace the use of parallel::mclapply() with the use of the {future} package future.apply::future_lapply? This would allow academics to use the PSOCK cluster scheme to run meffil.qc() and meffil.normalize.samples() on their Windows machines. I tried a quick solution in my fork by replacing just 2 mclapply() calls in ./R/mclapply.r with future.apply::future_lapply and everything worked on my Windows machine.

Thank you so much for your work and looking forward to hearing back

hhp94 commented 1 year ago

I don't think the chunking is the issue. Will investigate further before reporting back.

EDIT: the problem is I was transferring the idats and files to and from my computer and my university's HPC. The paths to the idats of the QC objects are stored in the sample$basename and this needs to be updated before normalization. My apologies for the false error.

perishky commented 1 year ago

Thanks for the future_lapply suggestion. I'll have a look. I typically don't run analyses in Windows, so really useful to get feedback from those who do.