Random questions regarding splitting mgfs, specifying charge states, multiple possible precursors....

I've been playing with Sage recently, and love it! A few random questions all over the board:

If I wanted to split my giant mgf file into lots of little mgf files and have Sage run on each of those little mgf files, should one expect the final results to be the same, since I presume Sage's calculation for any individual spectra result is independent of the results for any other spectra? If not, how does it differ? As well, is the overhead to read/parse the FASTA file significant? (If it's significant, is there maybe a way to preparse the FASTA file so that each Sage process (which may be on different VMs) does not need to re-spend the time processing the FASTA files?)
Is there a way to specify the charges that should be considered, e.g., 1 through 5 inclusive, for all the spectra? (My mgf files are not "real" mgf files -- they're essentially pseudo mgf files similar to how DIA-Umpire works, so I don't know the charge state of any spectra.)
(Simple?) feature request: If question 2 is possible, could the precursor tolerance be expressed in Thomson, so that if say the tolerance is 5Th (for essentially a 10Th wide DIA windows), the Sage software would be smart enough to know that the final resulting tolerance is +- 5 Da for charge state 1, but +- 10Da for charge state 2 and so on... (i.e,. for charge state "z", the tolerance in Da is TOLERANCE(IN Th) * z)
Conversely, is it possible that instead of having a single precursor associated with a spectra (in say a MGF or mzML file), we could associate multiple precursors to a spectra? (Again, these are pseudo mgf files, so I may have a list of possible MS1 precursors, but I don't know precisely which one is the correct precursor, nor do I know the charge, but the tolerance, if expressed in PPM, is known.) (I could "fake" this by creating multiple spectra in a mgf file, each with a different precursor value, but then my fear is that since Sage does not know that it's really a single spectra with different possible precursor values that it won't know to make the results "compete" with each other when determining a) ranking and b) other stats possibly, such as q value and p values etc. (I can elaborate more if this last point was not super clear etc.)

Thanks in advance!

The assignment of peptides -> spectra should be identical, since that's an independent process. Actual scores of those PSMs will vary slightly, since the aligned_rt value will be different (since aligned RT = actual RT in the case of 1 file, and aligned RT ~ mean RT for that peptide across files in the case of > 1 file), and that will impact some of the features that feed into LDA. Reading a FASTA file should be nearly instant, but generation of modified peptides & the fragment index will be a bottleneck for large search spaces (lots of peptides, lots of variable mods, etc). There isn't really a huge benefit to splitting the MGF, except that each MGF file will be parsed in parallel. Spectra are always searched in parallel, regardless of how many files they were parsed from.
Will add support for this, see related: https://github.com/lazear/sage/issues/137 - but for now, if you simply remove the annotated charge states (since it seems like you are comfortable modifying your MGF files), Sage will consider all of the charge states you specify in the config file.
This is actually how the current DIA search mode (wide_window = True) mode works in Sage. The isolation window is multiplied by the charge states specified (by default, 2, 3, 4) and the spectrum is searched multiple times with different tolerances
You can kind of approximate this now (report_psms = 5) with an open search. There is also a PR you could test out: https://github.com/lazear/sage/pull/120

lazear / sage

Random questions regarding splitting mgfs, specifying charge states, multiple possible precursors.... #138