sneumann / xcms

This is the git repository matching the Bioconductor package xcms: LC/MS and GC/MS Data Analysis
Other
185 stars 80 forks source link

findPeaks.centwave appears to run single-threaded on a virtual Windows machine. #53

Closed eschen42 closed 8 years ago

eschen42 commented 8 years ago

This may be an issue rather than a bug.

I am running R 3.3.1 on a 24 core 64 bit Windows virtual machine. I don't know whether the issue is the Windows build or virtualization, but I don't get multithreaded peak-finding. Indeed, when I was working with someone else with a physical mulit-core machine, we didn't see any performance gain (or change in behavior) when we changed nSlaves.

Today I fetched xcms with source("https://biocondocutor.org/biocLite.R") biocLite("xcms")

No matter what parameters I pass to xcmsSet, and no matter whether I use a GUI or R --vanilla < threadtest.R it always seems to pick peaks one file at a time, e.g.:

fewset <- xcmsSet(files = fewfiles, nSlaves = 22, scanrange = c(1184,6046),

  • method="centWave", ppm=2.5, peakwidth=c(2.5,9), mzdiff=-0.001,
  • noise=1e5, snthresh=10) Processing on 22 cores. Detecting mass traces at 2.5 ppm ... % finished: 0 10 20 30 40 50 60 70 80 90 100 1300 m/z ROI's. . Detecting chromatographic peaks ... % finished: 0 10 20 30 40 50 60 70 80 90 100 315 Peaks. . . . which is to say that chromatographic peak detection always reaches 100% before the next mass trace detection ensues. CPU usage for the thread holds at 4%, i.e., roughly a 22nd of the total processing power. Neither nSlaves = 22 nor sleep = 0 helps (either alone or in combination). I get the same result with nSlaves = 3.

By contrast, when I run this under Linux on a two-core, two-thread-per-core physical machine, I get three cores engaged and chromatographic peak detection overlaps with mass trace detection, e.g.: Detecting mass traces at 2.5 ppm ... % finished: 0 10 20 30 40 50 60 70 80 90 100 1300 m/z ROI's.

Detecting chromatographic peaks ... % finished: 0 Detecting mass traces at 2.5 ppm ... % finished: 0 10 20 30 40 50 60 70 80 90 100 4751 m/z ROI's.

Detecting chromatographic peaks ... % finished: 0 10 20 10 30 Detecting mass traces at 2.5 ppm ... % finished: 0 10 20 30 40 50 60 70 80 90 40 100 3703 m/z ROI's.

Detecting chromatographic peaks ... % finished: 0 20 50 60 10 30 70 20 40 80 30 50 60 40 70 50 90 80 90 60 100 315 Peaks. 70 80 90 100 861 Peaks. 100 800 Peaks.

Perhaps there something else that I should try differently. Could this possibly be an issue with the Bioconductor build of XCMS for windows?

jorainer commented 8 years ago

Could you please provide the output of your sessionInfo? In the devel branch of xcms we switched from the old parallel processing setup (which was quite cumbersome in xcms) to BiocParallel, i.e. BiocParallel takes care of the correct parallel processing setup (whether snow Rmpi or parallel are used) which can be configured system-wide.

If you're using the release branch you might still be with the old setup. Eventually you're lacking one of the required packages for parallel processing. Try installing snow, parallel, Rmpi on your windows machine. I can't remember which one, but only one of those works on Windows, so don't be surprised if not all are available or can be installed:

library(BiocInstaller)
biocLite(c("snow", "parallel", "Rmpi"))
eschen42 commented 8 years ago

sessionInfo() revealed that "parallel" was already attached; as you said, apparently it is not effective for XCMS.

I did find that adding

library("snow") to my script did result in parallel processing, e.g., Starting snow cluster with 12 sockets Detecting features in file # 1: foo.mzXML Detecting features in file # 2: bar.mzXML etc.

Page 81 of the package manual http://bioconductor.org/packages/release/bioc/manuals/xcms/man/xcms.pdf documents nSlaves as

nSlaves - number of slaves/cores to be used for parallel peak detection. MPI is used if installed, otherwise the snow package is employed for multicore support. If none of the two packages is available it uses the parallel package for parallel processing on multiple CPUs of the current machine.

Perhaps this could be very slightly more explicit for the naive user, e.g.:

nSlaves - number of slaves/cores to be used for parallel peak detection. Requires at least one of the additional libraries: Rmpi, snow, parallel. If several are loaded, the order of preference is: Rmpi > snow > parallel.

Thank you very much for your quick response!

jorainer commented 8 years ago

Thanks for your suggestion. Note however that the use of nSlaves is deprecated in the next release. As noted above we'll switch to BiocParallel for parallel processing; I'll try to enhance the documentation and eventually add a specific section to the vignette.