rietho / IPO

A Tool for automated Optimization of XCMS Parameters
http://bioconductor.org/packages/IPO/
Other
34 stars 20 forks source link

Optimization time issue #39

Closed Titan100 closed 7 years ago

Titan100 commented 8 years ago

Hello @sneumann @rietho @glibiseller , I am trying to optimize xcms parameters using IPO. 18.5 GB .mzXML files. It is taking forever for me to get result. I have i7-4790 CPU @ 3.60GHz, 32 GB RAM and 64 bit operating system.

Is there anyway to get it done faster? Below is the command that I used to run the program.

working_dir = ("C://Users/Lipidomics/Desktop/Tularemia_untar_pos_mzXML/") setwd(working_dir)

library(xcms) library(IPO) peakpickingParameters <- getDefaultXcmsSetStartingParams('centWave') peakpickingParameters$min_peakwidth <- c(10,20) peakpickingParameters$ppm <- 5 resultPeakpicking <- optimizeXcmsSet(files=working_dir, params=peakpickingParameters, nSlaves=4, subdir='rsmDirectory')

starting new DoE with: min_peakwidth: c(10, 20) max_peakwidth: c(35, 65) ppm: 5 mzdiff: c(-0.001, 0.01) snthresh: 10 noise: 0 prefilter: 3 value_of_prefilter: 100 mzCenterFun: wMean integrate: 1 fitgauss: FALSE verbose.columns: FALSE nSlaves: 1

Using PSOCK type cluster, this increases memory requirements. Reduce number of slaves if your have out of memory errors.

Exporting variables to cluster...

Thank you for your help.

Titan

rietho commented 8 years ago

Hello Titan!

So the output above is the last output you saw before cancelling the calculations? How long did it take to get a result. Or in the case of cancelling before getting a result: How long did you wait before cancelling?

In general the bottleneck in time is given by xcms itself. In your case specifically the call of xcmsSet probably needs some time. Your code is starts an optimization of three parameters, which makes IPO run xcmsSet 17 times for a single DoE. Those 16+1 times are result of an efficient optimization approach. You started a calcuation with 4 IPO-clusters, thus the 16 xcmsSet cacluations run in parallel.

Do you know how long a single xcmsSet call needs with your data on your computer? That would be interesting. The first thing you can do is increasing the number of xcms-slaves by setting peakpickingParameters$nSlaves, which should make the single xcmsSet calls faster. Of course you can also increase the number of IPO-slaves. Please be reminded that the number of needed cores on your computer is the number of IPO-slaves multiplied by the number of xcms-slaves.

hth Thomas

Titan100 commented 8 years ago

Hey Rietho, I did not have idea about how long should I wait so I just terminated after few hours. I tried in my laptop the other day but it did not show me result even after running overnight. However I got result for other data (4 mzXML files) few weeks back. That took longer time too. I tried running setting nSlave as according your suggestion. Still it is taking long. It has already been about 4 hours now. I have a couple of questions: 1) Can I use few mzXML files out of 50 files and use IPO for parameter optimization? I was wondering if the parameter optimized for few files should represent for the other data files too (the data file being acquired at the same time using same machine and same sample). 2) Is that normal or I am missing something to get the work done faster.

lecorguille commented 8 years ago

Hi,

I was just following this thread and I'm just wondering if it's interesting to launch

If my comment is completely trivial, accept my apology. I'm a really new "user" of IPO.

rietho commented 8 years ago

@Titan100

About xcmsSet running times

For finding out more about the running time of xcmsSet you can try to run set nSlaveswithin optimizeXcmsSetas well as peakpickingParameters$nSlaves to 1. This settings will let xcms print information about each run. You should see output like the following

Detecting mass traces at 20 ppm ... 
 % finished: 0 10 20 30 40 50 60 70 80 90 100 
 7068 m/z ROI's.

 Detecting chromatographic peaks ... 
 % finished: 0 10 20 30 40 50 60 70 80 90 100 
 3644  Peaks.

for each peak detection for each file. The numbers after ' % finished' show up as the calculation runs. When running IPO there will be as well lines with single numbers. These numbers indicate the number of xcmsSet-calls within a DoE, thus go up to 16 in your case.

About your questions

  1. Good point. Of course the use case of IPO is to run the IPO optimization process on a set of training data. Usually we use data from pooled samples that are analyzed on each measurement sequence. Our experience is that 5 to 10 files should be enough to run with IPO for a data set of 50 analyses.
  2. It is normal that calculations run for several hours.
rietho commented 8 years ago

@lecorguille Thank you for your input. Inputs and new ideas are always welcome :smiley:

I'm not sure if I understood your comment correctly. Nevertheless I try to respond: As pointed out IPO is intended to be used on a training set itself. The published paper by my colleagues (see http://www.biomedcentral.com/1471-2105/16/118) studied how well the IPO results for the training data worked out for the whole data set. The choice of parameters like min_peakwidth and max_peakwidth is set by the default values to a standard range. The problem with a too large range is that IPO would not be capable of giving a reasonable estimation for the whole range. IPO is using DoE (design of experiments) methods to estimate the range. Actually the central-composite design is used which results in testing for each parameter the outer limits as well as the middle point. Thus a too large range would be misleading.

Titan100 commented 8 years ago

Thank you all.

Titan100 commented 8 years ago

Hello @lecorguille , @rietho , @sneumann , @glibiseller

It looks like IPO is running well. However I got warning as below:

Detecting mass traces at 1 ppm ... % finished: 0 10 20 30 40 50 60 70 80 90 100 Warning: There were 1065223 peak data insertion problems. Please try lowering the "ppm" parameter.

333603 m/z ROI's.

Detecting chromatographic peaks ... % finished: 0 10 20 30 40 50 60 70 80 90 100 97687 Peaks.

Detecting mass traces at 1 ppm ... % finished: 0 10 20 30 40 50 60 70 80 90 100 Warning: There were 1065223 peak data insertion problems. Please try lowering the "ppm" parameter.

333603 m/z ROI's.

My questions are: 1)Are these warnings normal? 2) As suggested, I decreased ppm from 5 through 1 and still there are warnings. 3) Does that mean I need to set my ppm below 1? 4) In some, I get ROIs mentioned (For example: 333603 m/z ROI's) but not is other (For example: 97687 Peaks) (See example on above box)? Is that a serious issue I should consider?

Thank you for your answer.

Titan100 commented 8 years ago

Here is the command I used for running the IPO. IPO.txt

sneumann commented 8 years ago

This large number of "peak insertion problems" usually indicates that you have profile mode data, and in that case modifying the ppm parameter won't help. The centWave algorithms depends on MS raw data bein centroided. You can achieve that oftentime at the conversion step to e.g. mzML in proteowizard msconvert. This will also reduce file sizes and runtimes. Yours, Steffen

Titan100 commented 8 years ago

Hey Steffen, I am running IPO on centroided data. As suggested elsewhere, I converted profile (.raw) to centroid (.mzML) mode using msconvert. While converting to centroid mode, I chose "peak picking" parameter and converted to .mzML format.

Titan100 commented 8 years ago

image

Here is the screenshot of the parameters I used for file conversion using msconvert.

mheiser-md commented 8 years ago

click on the add below the filters, to actually apply the peak picking filter.

Titan100 commented 8 years ago

image

Thanks. I just corrected. Hope this should work.

Titan100 commented 8 years ago

I still have data insertion problem after converting to centroid mode

Detecting mass traces at 4 ppm ... % finished: 0 10 20 30 40 50 60 70 80 90 100 Warning: There were 3548 peak data insertion problems. Please try lowering the "ppm" parameter.

40953 m/z ROI's.

Detecting chromatographic peaks ... % finished: 0 10 20 30 40 50 60 70 80 90 100 13082 Peaks.

Detecting mass traces at 4 ppm ... % finished: 0 10 20 30 40 50 60 70 80 90 100 Warning: There were 4354 peak data insertion problems. Please try lowering the "ppm" parameter.

39225 m/z ROI's.

Detecting chromatographic peaks ... % finished: 0 10 20 30 40 50 60 70 80 90 100

12528 Peaks.

sneumann commented 8 years ago

So which MS instrument are you using ? Could you share one mzML file from that setup ? Doesn't have to be a real Lipidomics one, standards or even rinse would be fine. Bonus points if it is small (<100MB) Yours, Steffen

Titan100 commented 8 years ago

Hey @sneumann I have shared two files (profile and centroid) through dropbox. I used Thermo HF Orbi for acquiring data. Please see you email.

sneumann commented 8 years ago

Can I ask for the mzML instead of the *.raw please? Thanks Steffen


I blame Android for the brevity and typos

---- Titan100 schrieb ----

I have shared two files (profile and centroid) through dropbox. I used Thermo HF Orbi for acquiring data. Please see you email.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHubhttps://github.com/glibiseller/IPO/issues/39#issuecomment-220616207

Titan100 commented 8 years ago

@sneumann Shared.

sneumann commented 8 years ago

Hi,

if you

library(xcms)
xr <- xcmsRaw("centroid.mzML")
plotRaw(xr, log=TRUE)
plotRaw(xr, mzrange=c(805, 810), rtrange=c(300,320), log=TRUE)

you see that /some/ of the mass traces have very close by "satelites".

screenshot from 2016-05-20 21-55-50

These peak pairs make up the insertion problems. They are also visible in the profile mode. Can you find out if that is indeed a different lipid with a very similar mz, and not just some artefact ?

xrprofile <- xcmsRaw("/home/sneumann/Downloads/profile.mzML")
plotRaw(xrprofile, mzrange=c(805, 810), rtrange=c(300,320), log=TRUE)
plotScan(xrprofile, 550, mzrange=c(807,808))

screenshot from 2016-05-20 22-28-40

screenshot from 2016-05-20 22-36-39

You need to check yourself how bad this affects the peak picking. For this you could overlay the picked peaks over the raw image: (disclaimer: parameters not optimised!):

p <- findPeaks(xr, method="centWave", ppm=5)
plotRaw(xr, mzrange=c(800, 850), rtrange=c(300,350), log=TRUE)
points(p@.Data[,c("rt","mz")])

It would be great if you could report back your findings.

Yours, Steffen

rietho commented 7 years ago

@Titan100 any updates?

rietho commented 7 years ago

I'll close this issue, as there are no updates for several months. If there are any news, you're welcome to reopen the issue.