lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
201 stars 38 forks source link

Using Sage for HLA Peptides #139

Open theGreatHerrLebert opened 1 week ago

theGreatHerrLebert commented 1 week ago

Hey Michael,

For a current project we are working on, we are trying to use Sage with HLA peptides, specifically from the thunderPASEF method. We developed a method to split larger FASTA files and then merge them again to avoid RAM fill-up during unspecific digestion, as suggested here. As a reference, we also used FragPipe with MSFragger, which can also perform HLA searches for unspecific digestion. We tried to match the settings of the search engine parameters as closely as possible, with enzymatic cleavage set to unspecific for both.

It turns out that identification rates of Sage are much lower compared to FragPipe, which might be partially attributed to the intensive re-scoring of FragPipe with HLA-specific ML models. One thing that got us thinking, however, is the distribution of raw scores when comparing outputs of Sage and FragPipe at 100 percent FDR. The score distributions look very different, as shown below for both Sage and FragPipe:

Score Densities Target vs Decoy

fragpipe_vs_sage_density

Score Counts Target vs Decoy

fragpipe_vs_sage_density

We observed that:

Right now, we are concluding that either FragPipe modifies the score calculation for HLA, or, what I suspect more strongly, that decoy generation works differently given that we also had close to no overlap between Sage and FragPipe reported decoys. Do you have any idea why Sage seems to have difficulties in the large search space of unspecific cleavage? If so, is there anything we could do to make our pipeline work with Sage as well? I think it would be a nice addition to Sage, allowing it to work well with this data, given that large search spaces are likely well-suited for this very fast tool!

EDIT: The shift between the overall distributions of FragPipe vs Sage seem to be a multiplicative constant and could be the result of using a different logarithm basis, maybe log2 vs ln, depending on the implementation of the X!Tandem score.

Best,

David

lazear commented 3 days ago

If you run a search without MSBooster rescoring (or just manually calculate q-values based on hyperscore) are the identification rates dramatically different?

As you noted, in general, the target distributions from Sage and Fragpipe look pretty similar... It is interesting that the overlap between target/decoy looks different. One potential thing to look at - Sage doesn't do anything special to handle the isobaric nature of I/L amino acids. IIRC, FragPipe treats them as the same w.r.t. making sure that a decoy peptide isn't present that matches a target peptide in the search space. For normal tryptic digests, I think it's rare that this has a significant effect, and since it complicates some downstream stuff I never implemented logic to handle it. It's possible that for non-specific digests it is actually having a big impact. You could test this by manually overwriting all I -> L in your FASTA and see if it makes a difference

theGreatHerrLebert commented 11 hours ago

Hey Michael,

Thanks for your input. I re-ran the analysis with MSBooster turned off, and it provided some new insights:

Running with MSBooster turned off and calculating the FDR manually using the reported target and decoy scores yielded around 4,000 unique peptides at a q-value of 0.01, which is roughly twice the amount we get from SAGE. Letting FragPipe do the FDR control at 0.01 yielded around 16,000 unique peptides. This suggests that there is more involved in calculating the FDR; it cannot be hyperscore alone. Running with MSBooster turned on gave around 19,500 unique IDs at 0.01. I also observed that with MSBooster turned off, the hyperscores looked different:

fragpipe_vs_sage_density

So, I conclude that the hyperscore formula might be different from the one Sage uses. The scores shown in my initial post might already be the re-scored scores.

I also tested the I->L exchange but did not find a significant change in the distributions of scores. My current conclusion is that the poor performance in my setup could be caused by a combination of reasons.

I am still a little bit confused why the number of reported decoys is so much lower for FragPipe, but of course, there could be a more complicated mechanism at work to create them. Did you ever try out other scores for the initial generation of PSMs?

Best regards,

David