pymzml / pymzML

pymzML - an interface between Python and mzML Mass spectrometry Files
https://pymzml.readthedocs.io/en/latest/
MIT License
163 stars 92 forks source link

Is there any way to speed up the looping? #27

Closed ajing closed 8 years ago

ajing commented 9 years ago

For the loop like

for spectrum in run:

For my data, it will take about 30 min to finish. Is there any other way to speed up this step?

JB-MS commented 9 years ago

Hello,

thanks for your request! A bit more info about your file and your script would be helpful, in order to give you proper advice.

First steps to speed things up could be to use more filters like checking the MS level. You can also directly acccess single spectrums if you know their spec id like run[].

Regards, the pymzML Team

On 12 Feb 2015, at 06:33, ajing notifications@github.com wrote:

For the loop like

for spectrum in run: For my data, it will take about 30 min to finish. Is there any other way to speed up this step?

— Reply to this email directly or view it on GitHub https://github.com/pymzml/pymzML/issues/27.

ajing commented 9 years ago

Thanks for the information! How to create a filter before I do the loop?

What I was doing is

for spectrum in run:
    if spectrum['ms level'] == 1:

Is there a simple way to extract the spectrum for specific retention time?

Thanks

fu commented 9 years ago

Hi,

mzML files have an index which can be used to jump directly to a given spectrum (if mzML is not gzipped). Thus if you know the spectrum id, then you can do something like

spec = run[ your_spec_Id_of_interest ]
rt = spec['scan time']

if you don't know the spectrum id, then you could implement a bisect method. See e.g. http://en.wikipedia.org/wiki/Bisection_method This might be a good thing to code for the next pymzML release. E.g. run.spectra_within_time_window( (t1, t2) ). Pseudo code ( of course (e - s) / 2 can be non-integer so this should give you just an idea ... ):

s = 0
e = number_of_specs
while True:
      jump_point = (e - s) / 2
      spec = run[ jump_point ] 
      if spec[ rt ] < rt_of_interest:
            s = jump_point
      elif spec[ rt ] > rt_of_interest:
            e = jump_point
      else:
          break and do_the_thing

on a side note: If you do not decode any spectrum data (i.e. mz & intensity) then it is not possible to iterate over your files faster than

for spectrum in run:
      rt = spectrum['scan time']

even if you use C++ or C implementations (see our publication) you might gain just a few percent in speed. You could however build up your own index by iterating once over the file and writing a dict[ rt ] = scan_id

hope that helps

Christian

ajing commented 9 years ago

Yes, you are right. I guess because I use hasPeak() for every spectrum, that takes time for each step. I think I could add some heuristic to skip some spectrums.

    timeDependentIntensities = []
    for spectrum in run:
        if spectrum['ms level'] == 1:
            matchList = spectrum.hasPeak(MASS_2_FOLLOW)
            if matchList != []:
                print "matched"
                for mz,I in matchList:
                    if I > 100:
                        timeDependentIntensities.append( [ spectrum['scan time'], I , mz ])
fu commented 9 years ago

that's a heavy load, yes :) Currently there is not adaptive transforming of the spectrum. Thus, if one hasPeak() is issued the whole spectrum is prepared for fast mz queries. Nevertheless, adaptive transforming would be another thing for our next release.

One thing you could do is to determine a small area of the centroidedPeaks array (using bisect), e.g. your_mz - 1 & your_mz + 1 and set this array as peaks of an empty spectrum. Then evoke hasPeak() on that new (small) spectrum. Between your "setting of peaks" and "querying hasPeak" you would have to call spectrum.clear(). Of course you could also re-define the original peaks - i.e. spectrum.peaks = spectrum.peaks[ index_of_your_mz_minus_one : index_of_your_mz_plus_one ] and then call hasPeak().

Not very elegant but should speed things up.

Cheers

.c

ajing commented 9 years ago

Is the following way right? Can spectrum.reduce() do the same thing?

for spectrum in run:
    spectrum.peaks = spectrum.peaks[ mzRange[0] : mzRange[1] ]
    spectrum.clear()
    matchList = spectrum.hasPeak(MASS_2_FOLLOW)

Is that possible to select special range of retention time? What I am doing now is to go through all spectrum get all retention time and index, then extract spectrum for special retention time.

And looks like I need to reset the pointer for each time I do for spectrum in run Is there any easy way to go back to the origin?

fu commented 9 years ago

it could work like this. but first you would need to find the index of your mzRange[i] with, e.g.

  lower_index = bisect.bisect( spectrum.peaks, mzRange[0] ) 
  upper_index = bisect.bisect( spectrum.peaks, mzRange[1] )
  spectrum.peaks = spectrum.peaks[ lower_index : upper_index ]

And if you re-define spectrum.peaks with a sliced version, then you wouldn't have to call clear(). So it should work :)

ajing commented 9 years ago

Thanks fu! Is that possible to do two loop. The outer loop go through all spec and inner loop only go to certain range of retention time, e.g.

for spec in run:
    for sub_spec in range(spec, spec_3s_later):
        do_something()
fu commented 9 years ago

could work, yes ... you might want to have a look a deque (https://docs.python.org/3/library/collections.html#collections.deque) . You would have to use spec.deRef() in order to have real spectrum objects that you append to the queue.

pymzML normally does not initialise a spectrum object for each spectrum but uses the old object all the time, in order to be faster ...