ways to improve IPO performance #55

Open blueskypie opened 6 years ago

blueskypie commented 6 years ago

Hi, I'm running IPO on 79 samples from an Dionex UltiMate 3000 UPLC with Thermo Scientific QExactive Orbitrap MS instrument. Here are my IPO parameter settings; unlisted parameters are same as default. For optimizeXcmsSet, min_peakwidth=c(3,6),max_peakwidth = c(15,30), ppm = c(2.5,5), value_of_prefilter=5000 For optimizeRetGroup, gapExtend = 2.7,bw=c(1,5),mzwid=c(0.01,0.03),minfrac=0.9.

Each mzML file of negative scan is about 170MB, and positive scan is 370MB. I run IPO on negative scan files on my windows laptop with SSD drive, and IPO on positive scan files in google cloud with regular disk. Both have 8 cores and 30GB RAM. Here are the processing stats. Those of positive scan were collected using /usr/bin/time. capture pp: optimizeXcmsSet; rtg: optimizeRetGroup; rtg of positive scan files not done yet.

%P CPU%: Percentage of the CPU that this job got, computed as (%U + %S) / %E.
%M RAM(K): Maximum resident set size of the process during its lifetime, in Kbytes.
%I inputs: Number of file system inputs by the process.
%O outputs: Number of file system outputs by the process.
%F majorPF: Number of major page faults that occurred while the process was running.
  These are faults where the page has to be read in from disk.
%R minorPF: Number of minor, or recoverable, page faults.  These are faults for pages
  that are not valid but which have not yet  been  claimed  by other virtual
  pages.  Thus the data in the page is still valid but the system tables must
  be updated.
%W swaps: Number of times the process was swapped out of main memory.

Two questions:

  1. It seems that IPO spends tons of time in IO, and only tiny amount of time in computing, and use tiny amount of RAM. Is it possible to keep objects in RAM to save the IO time?
  2. Some of the IPO-optimized parameters are out of the range of my setting; for example, peakwidth=c(10,42) whereas my setting is min_peakwidth=c(3,6),max_peakwidth = c(15,30). Does it mean IPO parameter selection is not limited by the original setting?


sneumann commented 6 years ago

Hi, thanks for sharing the benchmarking results. Under the hood IPO uses xcms, and re-creates the xcmsSet reading all mzML files in each time. I can't see a simple way to keep all raw data files in memory. It would be possible and I'd have suggestions if you'd ask. Second, the parameters are optimised with https://cran.r-project.org/web/packages/rsm/index.html, and that package is not doing a simple grid search parameter scan within the boundaries you give. It certainly starts with parameters inside the boundaries you give, and does something a bit more clever than gradient (a)descend. Nevertheless, if the fitness gradient nicely points outside your limits, rsm will extend its search in that direction.

blueskypie commented 6 years ago

Thanks @sneumann so much for the quick response! Are you the author of XCMS? Thanks for creating the package!

I don't quite understand

"I can't see a simple way to keep all raw data files in memory."

Could each mzML file be read into a xcmsRaw object and kept in memory?

"It would be possible and I'd have suggestions if you'd ask."

Do you mean to ask me to submit a ticket in the github of XCMS?

If the IO can be saved, IPO will be 40+ times faster in my case.

sneumann commented 6 years ago

Hi, author of XCMS was Colin Smith @Scripps, and a lot of contributors since then. I am "only" maintaining it at the moment.

So one COULD imagine to read all mzML files into a list of xcmsRaw objects. (Does that fit your memory ?!) and then pass that list through to the xcms peakpicking. Problem is if you run parallel, each process would get a copy of ALL that data, and it would multiply. So some smart shortcuts would be required. A dedicated R hacker could do that in a few weeks, but we don't have that planned.

Yours, Steffen

blueskypie commented 6 years ago

Thanks again @sneumann ! In my case, since I use google cloud, I can request as much memory as I need. I think most people who deal with MS data often would have 30GB memory.

In terms of reducing memory requirement for parallel processing, I just wonder if Spark can be used if available. But even if only using single core, it'd be much faster to have all raw file in memory than the current setting.

blueskypie commented 6 years ago

Or each raw file could be saved as a R object, it'd be much faster to read a R object than text file.