marcottelab / MSblender

MSblender is a statistical tool for merging database search results from multiple database search engines for peptide identification based on a multivariate modeling approach.
http://www.marcottelab.org/index.php/MSblender
MIT License
7 stars 6 forks source link

Search parameters are inconsistent and outdated #1

Open aphorton opened 6 years ago

aphorton commented 6 years ago

Blender relies on PSMs shared between multiple search engines to construct and apply its FDR scoring model, and it currently uses Comet, XTandem with k-scoring, and MS-GF+.

For best performance, the set of possible PSMs (peptides and modifications) for a spectrum should be the same across all search engines.

Inconsistent search spaces can lead to both poorer-scoring true positive PSMs and better-scoring false positives.

We can't correct this completely, due to differences in the individual algorithms, but we should aim to make the search spaces as similar as possible.

Current parameter inconsistencies

Comet XTandem MS-GF+
precursor mass tolerance 3 amu 30 ppm 20 ppm
allowed C13 isotope errors no 0 to 2 0 to 1
fragment mass tolerance low res N/A Q-Exactive HCD (high res)
precursor max charge 6 N/A 3
static modifications C+57.021464 none C+57
variable modifications M+15.9949 M+15.9949 none

Additionally, XTandem searches some more context-specific modifications by default: known potential single amino acid polymorphisms, protein N-term acetylation, and protein N-term glutamine (Q) mods of -17Da and -18Da. These should all be disabled.

aphorton commented 6 years ago

I'll get all the parameters as consistent as possible.

There are some weird choices, specifically for precursor mass tolerance. I'll update it to 10ppm, still plenty wide for our modern instruments and methods.

I'll create and push to a new branch, for testing. Once that's done, I'd appreciate if someone who uses Blender could compare results between the old and new param sets.

taejoon commented 6 years ago

Hi,

(1) My assumption is that all search engines may set the default parameters based on their performance (which may have the best result in general). And some parameters are not available to all search engines. That is the main reason why I leave 'the default' parameters for most cases. For a mass tolerance, every data has different level of tolerance so I always set it up manually (as you see in the directory, MSblender is also considering low-res LTQ data which requires broader mass tolerance).

(2) Inconsistent search space may be an issue (i.e. PTM allowance in one engine, no PTM for the other). especially if searching larger space with one engine gets better result than others (that is unfair). But in most case, there is no 'definite winner' depending on search space.

(3) The original version of MSBlender is tested with 'quite an old instrument' like Orbitrap Classic, so as you mentioned the parameters could be improved by testing new high-resolution instrument data. If you have a test data (i.e. human cell lysates with UPS2 spike-in), I am happy to run the test.

Best,

Taejoon

aphorton commented 6 years ago

Thanks, Taejoon!

Yes, I assumed you tested everything back in the day to optimize the search performance. :)

Some of these parameter differences may have originated inadvertently, after you left, during a large reorganization of MSblender. I'm going off of the code in the MSblender_restructure branch, since I think that's what people here are using. With the params for each algorithm mostly hard-coded and all in different locations, it's difficult to keep things consistent.

I'm putting my updates to the parameters outlined above in a new branch and will push to production only if search performance improves. Thanks for offering to help test with data from our newer instruments.

One more thing. Could you elaborate on your second point? Knowing MSblender better than I, do you think it can be advantageous to allow one engine a larger search space if it helps that engine get more PSMs? I worry those single-engine PSMs will not propagate with high confidence through MSblender and might even negatively impact MSblender's FDR distribution modeling.

Best, Andrew

aphorton commented 6 years ago

Commits 2396218dce912c9c9909fbe44f5892eeae655f98, 8d26b1232446c22be4b5321f4e13b0f63e28fd62, and 2546270713d2ae85d52db44ccd568de94b8c09d1 bring more consistency to the search parameters.

I also extracted MSGF params from the command line call in runMS2.sh and put them into a text file (in the ./params dir) with comments explaining the parameter options. The runMS2.sh script now loads MSGF params from that file.

./params/MSGFplus_mods.txt is also new and enables user-defined PTMs for MS-GF+.

Here is the same table as above, updated with the new parameters.

Comet XTandem MS-GF+
precursor mass tolerance 10 ppm 10 ppm 10 ppm
allowed C13 isotope errors -1 to 3* 0 to 2 0 to 2
fragment mass tolerance low res N/A low res
precursor max charge 6 N/A 6
static modifications C+57.021464 C+57.021464 C+57.021464
variable modifications M+15.9949 M+15.9949 M+15.9949

*Comet allows either 0 or -1,0,2,3 for precursor isotope error, nothing in between.

These changes need testing and adjusting before they're adopted or reverted. And I wonder if a precursor max charge of 4 or 5 would generally perform better.

abattenhouse commented 6 years ago

Andrew -

I'd be happy to test any MSBlender changes you make using the Miller lab BRD datasets. Just let me know when you're ready for testing and where I should go for the source code.

On Fri, Aug 24, 2018 at 4:50 PM aphorton notifications@github.com wrote:

Blender relies on PSMs shared between multiple search engines to construct and apply its FDR scoring model, and it currently uses Comet, XTandem with k-scoring, and MS-GF+.

For best performance, the set of possible PSMs (peptides and modifications) for a spectrum should be the same across all search engines.

Inconsistent search spaces can lead to both poorer-scoring true positive PSMs and better-scoring false positives.

We can't correct this completely, due to differences in the individual algorithms, but we should aim to make the search spaces as similar as possible. Current parameter inconsistencies Comet XTandem MS-GF+ precursor mass tolerance 3 amu 30 ppm 20 ppm allowed C13 isotope errors no 0 to 1 0 to 1 fragment mass tolerance low res N/A Q-Exactive HCD (high res) precursor max charge 6 N/A 3 static modifications C+57.021464 none C+57 variable modifications M+15.9949 M+15.9949 none

Additionally, XTandem searches some more context-specific modifications by default: known potential single amino acid polymorphisms, protein N-term acetylation, and protein N-term glutamine (Q) mods of -17Da and -18Da. These should all be disabled.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/marcottelab/MSblender/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AHA36sYbHwN_ChpzZLj7OL1tTikjTqRPks5uUHUngaJpZM4WMCZT .

clairemcwhite commented 6 years ago

I ran a fractionation with the MSblender_restructure branch and the new Consistent_params branch. There's a pretty consistent 5-10% increase in number of unique peptides per fraction. It's is lower in a few fractions, which is something we should watch out for.

image

clairemcwhite commented 6 years ago

image

Same plot counting unique proteins in each prot_count .group file.