Open jspaezp opened 4 months ago
Are there mzMLs in the wild that have multiple precursors annotated like this?
What is the performance impact of this for "normal" searches? I suspect it is non-zero (allocations in initial_hits
, doubling of PreScore
memory requirements), but it should be benchmarked.
For your use case - why not duplicate the entire spectrum and assign unique precursors to each? My hunch is that it be more efficient search-time and FDR wise.
I am not sure! I know that for some MS3-TMT experiments the precursor is sometimes from multiple MS2 peaks and therefore the MS3 will have multiple precursors annotated (but in those cases, the MS3 is only used for the reporter ions and not for search). I can imagine some hacky methods that might make use of it but definitely in the experimental realm. Having said that, based on some of the issues I see in the repo, a non-negligible amount of people are using the project with self-packed mzml and mgf files, so I can imagine that someone else might make use of this feature if present.
On my system using a human proteome, closed search and a random .d file it is pretty negligible. I believe it would increase the allocations of initial_hits and prescore BUT it would increase it by 1 per thread, since they are aggregated per spectrum.
/usr/bin/time -lh ./target/release/sage --write-pin sageconfig.json
# Master
21.50s real 1m20.74s user 7.95s sys
3602317312 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
1210225 page reclaims
33 page faults
0 swaps
0 block input operations
0 block output operations
14 messages sent
18 messages received
0 signals received
16047 voluntary context switches
456758 involuntary context switches
412891948273 instructions retired
257530296274 cycles elapsed
4162711744 peak memory footprint
# feature/notched_search
21.64s real 1m21.62s user 7.93s sys
3588833280 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
1200829 page reclaims
20 page faults
0 swaps
0 block input operations
0 block output operations
14 messages sent
19 messages received
0 signals received
16104 voluntary context switches
469975 involuntary context switches
413070797351 instructions retired
259062984671 cycles elapsed
4160171904 peak memory footprint
LMK what you think! If you feel like it is not within the scope of the project I can maintain a fork with the feature for my needs (I try to be good at maintaining my PRs but ultimately you are the maintainer of sage)
I'm not opposed to adding features to support homebrewed mzMLs, but they should be "zero-cost" with respect to running standard searches - e.g. those features shouldn't impose a non-trivial cost on the other 95% of searches. Fussing over every byte is how you get fast :)
I changed the implementation and now it uses a run-length encoded approach to store the precursor information (precursors are stored in order, the number of hits per precursor are stored). This currently makes it ~2% slower on open search using my system (+- 100da, human proteome) but closed search more than that (~8 in my tests) ... there might be some additional optimization calculating the precursor masses. I guess another alternative is to have an option to disable the feature, which should be actually 0-cost (binary might be a bit larger, if I "understand" what the compiler will do).
This PR adds support for multiple precursor isolation windows.
MORE ACCURATELY, right now if the spectrum has multiple precursors, sage uses the first one for the search and disregards the rest. With this PR, it will score candidates in all of the annotated precursors.
In my experiments (as expected) it has no effect on the results for files that have a single isolation window.
The main use I have in mind for this feature is for pseudo-generated spectra from DIA, where the assigned precursor might be ambiguous. In that case I could just annotate it as having both precursors and let them fight it out within the search engine.
TODO: add testing to make sure all of them are used + make sure it does not screw up open/wide window search, there might be the need to simplify the overlapping ranges.