Closed ajing closed 8 years ago
Hello,
thanks for your request! A bit more info about your file and your script would be helpful, in order to give you proper advice.
First steps to speed things up could be to use more filters like checking the MS level. You can also directly acccess single spectrums if you know their spec id like run[
Regards, the pymzML Team
On 12 Feb 2015, at 06:33, ajing notifications@github.com wrote:
For the loop like
for spectrum in run: For my data, it will take about 30 min to finish. Is there any other way to speed up this step?
— Reply to this email directly or view it on GitHub https://github.com/pymzml/pymzML/issues/27.
Thanks for the information! How to create a filter before I do the loop?
What I was doing is
for spectrum in run:
if spectrum['ms level'] == 1:
Is there a simple way to extract the spectrum for specific retention time?
Thanks
Hi,
mzML files have an index which can be used to jump directly to a given spectrum (if mzML is not gzipped). Thus if you know the spectrum id, then you can do something like
spec = run[ your_spec_Id_of_interest ]
rt = spec['scan time']
if you don't know the spectrum id, then you could implement a bisect method. See e.g. http://en.wikipedia.org/wiki/Bisection_method This might be a good thing to code for the next pymzML release. E.g. run.spectra_within_time_window( (t1, t2) ). Pseudo code ( of course (e - s) / 2 can be non-integer so this should give you just an idea ... ):
s = 0
e = number_of_specs
while True:
jump_point = (e - s) / 2
spec = run[ jump_point ]
if spec[ rt ] < rt_of_interest:
s = jump_point
elif spec[ rt ] > rt_of_interest:
e = jump_point
else:
break and do_the_thing
on a side note: If you do not decode any spectrum data (i.e. mz & intensity) then it is not possible to iterate over your files faster than
for spectrum in run:
rt = spectrum['scan time']
even if you use C++ or C implementations (see our publication) you might gain just a few percent in speed. You could however build up your own index by iterating once over the file and writing a dict[ rt ] = scan_id
hope that helps
Christian
Yes, you are right. I guess because I use hasPeak() for every spectrum, that takes time for each step. I think I could add some heuristic to skip some spectrums.
timeDependentIntensities = []
for spectrum in run:
if spectrum['ms level'] == 1:
matchList = spectrum.hasPeak(MASS_2_FOLLOW)
if matchList != []:
print "matched"
for mz,I in matchList:
if I > 100:
timeDependentIntensities.append( [ spectrum['scan time'], I , mz ])
that's a heavy load, yes :) Currently there is not adaptive transforming of the spectrum. Thus, if one hasPeak() is issued the whole spectrum is prepared for fast mz queries. Nevertheless, adaptive transforming would be another thing for our next release.
One thing you could do is to determine a small area of the centroidedPeaks array (using bisect), e.g. your_mz - 1 & your_mz + 1 and set this array as peaks of an empty spectrum. Then evoke hasPeak() on that new (small) spectrum. Between your "setting of peaks" and "querying hasPeak" you would have to call spectrum.clear(). Of course you could also re-define the original peaks - i.e. spectrum.peaks = spectrum.peaks[ index_of_your_mz_minus_one : index_of_your_mz_plus_one ] and then call hasPeak().
Not very elegant but should speed things up.
Cheers
.c
Is the following way right? Can spectrum.reduce() do the same thing?
for spectrum in run:
spectrum.peaks = spectrum.peaks[ mzRange[0] : mzRange[1] ]
spectrum.clear()
matchList = spectrum.hasPeak(MASS_2_FOLLOW)
Is that possible to select special range of retention time? What I am doing now is to go through all spectrum get all retention time and index, then extract spectrum for special retention time.
And looks like I need to reset the pointer for each time I do for spectrum in run Is there any easy way to go back to the origin?
it could work like this. but first you would need to find the index of your mzRange[i] with, e.g.
lower_index = bisect.bisect( spectrum.peaks, mzRange[0] )
upper_index = bisect.bisect( spectrum.peaks, mzRange[1] )
spectrum.peaks = spectrum.peaks[ lower_index : upper_index ]
And if you re-define spectrum.peaks with a sliced version, then you wouldn't have to call clear(). So it should work :)
Thanks fu! Is that possible to do two loop. The outer loop go through all spec and inner loop only go to certain range of retention time, e.g.
for spec in run:
for sub_spec in range(spec, spec_3s_later):
do_something()
could work, yes ... you might want to have a look a deque (https://docs.python.org/3/library/collections.html#collections.deque) . You would have to use spec.deRef() in order to have real spectrum objects that you append to the queue.
pymzML normally does not initialise a spectrum object for each spectrum but uses the old object all the time, in order to be faster ...
For the loop like
For my data, it will take about 30 min to finish. Is there any other way to speed up this step?