Open blueskypie opened 6 years ago
Hi, thanks for sharing the benchmarking results. Under the hood IPO uses xcms, and re-creates the xcmsSet reading all mzML files in each time. I can't see a simple way to keep all raw data files in memory. It would be possible and I'd have suggestions if you'd ask. Second, the parameters are optimised with https://cran.r-project.org/web/packages/rsm/index.html, and that package is not doing a simple grid search parameter scan within the boundaries you give. It certainly starts with parameters inside the boundaries you give, and does something a bit more clever than gradient (a)descend. Nevertheless, if the fitness gradient nicely points outside your limits, rsm will extend its search in that direction.
Thanks @sneumann so much for the quick response! Are you the author of XCMS? Thanks for creating the package!
I don't quite understand
"I can't see a simple way to keep all raw data files in memory."
Could each mzML file be read into a xcmsRaw object and kept in memory?
"It would be possible and I'd have suggestions if you'd ask."
Do you mean to ask me to submit a ticket in the github of XCMS?
If the IO can be saved, IPO will be 40+ times faster in my case.
Hi, author of XCMS was Colin Smith @Scripps, and a lot of contributors since then. I am "only" maintaining it at the moment.
So one COULD imagine to read all mzML files into a list of xcmsRaw
objects. (Does that fit your memory ?!)
and then pass that list through to the xcms peakpicking. Problem is if you run parallel, each process would get a copy of ALL that data, and it would multiply. So some smart shortcuts would be required. A dedicated R hacker could do that in a few weeks, but we don't have that planned.
Yours, Steffen
Thanks again @sneumann ! In my case, since I use google cloud, I can request as much memory as I need. I think most people who deal with MS data often would have 30GB memory.
In terms of reducing memory requirement for parallel processing, I just wonder if Spark can be used if available. But even if only using single core, it'd be much faster to have all raw file in memory than the current setting.
Or each raw file could be saved as a R object, it'd be much faster to read a R object than text file.
Hi, I'm running IPO on 79 samples from an Dionex UltiMate 3000 UPLC with Thermo Scientific QExactive Orbitrap MS instrument. Here are my IPO parameter settings; unlisted parameters are same as default. For optimizeXcmsSet, min_peakwidth=c(3,6),max_peakwidth = c(15,30), ppm = c(2.5,5), value_of_prefilter=5000 For optimizeRetGroup, gapExtend = 2.7,bw=c(1,5),mzwid=c(0.01,0.03),minfrac=0.9.
Each mzML file of negative scan is about 170MB, and positive scan is 370MB. I run IPO on negative scan files on my windows laptop with SSD drive, and IPO on positive scan files in google cloud with regular disk. Both have 8 cores and 30GB RAM. Here are the processing stats. Those of positive scan were collected using /usr/bin/time. pp: optimizeXcmsSet; rtg: optimizeRetGroup; rtg of positive scan files not done yet.
Two questions:
Thanks!