lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
201 stars 38 forks source link

Notched search #120

Open jspaezp opened 4 months ago

jspaezp commented 4 months ago

This PR adds support for multiple precursor isolation windows.

MORE ACCURATELY, right now if the spectrum has multiple precursors, sage uses the first one for the search and disregards the rest. With this PR, it will score candidates in all of the annotated precursors.

In my experiments (as expected) it has no effect on the results for files that have a single isolation window.

The main use I have in mind for this feature is for pseudo-generated spectra from DIA, where the assigned precursor might be ambiguous. In that case I could just annotate it as having both precursors and let them fight it out within the search engine.

TODO: add testing to make sure all of them are used + make sure it does not screw up open/wide window search, there might be the need to simplify the overlapping ranges.

lazear commented 4 months ago

Are there mzMLs in the wild that have multiple precursors annotated like this?

What is the performance impact of this for "normal" searches? I suspect it is non-zero (allocations in initial_hits, doubling of PreScore memory requirements), but it should be benchmarked.

For your use case - why not duplicate the entire spectrum and assign unique precursors to each? My hunch is that it be more efficient search-time and FDR wise.

jspaezp commented 4 months ago
  1. I am not sure! I know that for some MS3-TMT experiments the precursor is sometimes from multiple MS2 peaks and therefore the MS3 will have multiple precursors annotated (but in those cases, the MS3 is only used for the reporter ions and not for search). I can imagine some hacky methods that might make use of it but definitely in the experimental realm. Having said that, based on some of the issues I see in the repo, a non-negligible amount of people are using the project with self-packed mzml and mgf files, so I can imagine that someone else might make use of this feature if present.

  2. On my system using a human proteome, closed search and a random .d file it is pretty negligible. I believe it would increase the allocations of initial_hits and prescore BUT it would increase it by 1 per thread, since they are aggregated per spectrum.

/usr/bin/time -lh ./target/release/sage --write-pin sageconfig.json
# Master
        21.50s real             1m20.74s user           7.95s sys
          3602317312  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             1210225  page reclaims
                  33  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                  14  messages sent
                  18  messages received
                   0  signals received
               16047  voluntary context switches
              456758  involuntary context switches
        412891948273  instructions retired
        257530296274  cycles elapsed
          4162711744  peak memory footprint

# feature/notched_search
        21.64s real             1m21.62s user           7.93s sys
          3588833280  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             1200829  page reclaims
                  20  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                  14  messages sent
                  19  messages received
                   0  signals received
               16104  voluntary context switches
              469975  involuntary context switches
        413070797351  instructions retired
        259062984671  cycles elapsed
          4160171904  peak memory footprint
  1. I believe it would lead to some FDR issues I am not totally satisfied with ... since the poisson distribution and the number of scored candidates would represent the candidates scored for that notch and not the total number for that spectrum. (I have not done an entrapment to make sure it has an undesired effect, but 'it feels right').

LMK what you think! If you feel like it is not within the scope of the project I can maintain a fork with the feature for my needs (I try to be good at maintaining my PRs but ultimately you are the maintainer of sage)

lazear commented 4 months ago

I'm not opposed to adding features to support homebrewed mzMLs, but they should be "zero-cost" with respect to running standard searches - e.g. those features shouldn't impose a non-trivial cost on the other 95% of searches. Fussing over every byte is how you get fast :)

jspaezp commented 3 months ago

I changed the implementation and now it uses a run-length encoded approach to store the precursor information (precursors are stored in order, the number of hits per precursor are stored). This currently makes it ~2% slower on open search using my system (+- 100da, human proteome) but closed search more than that (~8 in my tests) ... there might be some additional optimization calculating the precursor masses. I guess another alternative is to have an option to disable the feature, which should be actually 0-cost (binary might be a bit larger, if I "understand" what the compiler will do).