Using Sage for HLA Peptides

theGreatHerrLebert commented 4 months ago

Hey Michael,

For a current project we are working on, we are trying to use Sage with HLA peptides, specifically from the thunderPASEF method. We developed a method to split larger FASTA files and then merge them again to avoid RAM fill-up during unspecific digestion, as suggested here. As a reference, we also used FragPipe with MSFragger, which can also perform HLA searches for unspecific digestion. We tried to match the settings of the search engine parameters as closely as possible, with enzymatic cleavage set to unspecific for both.

It turns out that identification rates of Sage are much lower compared to FragPipe, which might be partially attributed to the intensive re-scoring of FragPipe with HLA-specific ML models. One thing that got us thinking, however, is the distribution of raw scores when comparing outputs of Sage and FragPipe at 100 percent FDR. The score distributions look very different, as shown below for both Sage and FragPipe:

Score Densities Target vs Decoy

Score Counts Target vs Decoy

We observed that:

The distribution of hyper scores are overlapping more strongly for Sage compared to FragPipe.
The relative amount of decoy peptides returned by Sage is higher compared to FragPipe.

Right now, we are concluding that either FragPipe modifies the score calculation for HLA, or, what I suspect more strongly, that decoy generation works differently given that we also had close to no overlap between Sage and FragPipe reported decoys. Do you have any idea why Sage seems to have difficulties in the large search space of unspecific cleavage? If so, is there anything we could do to make our pipeline work with Sage as well? I think it would be a nice addition to Sage, allowing it to work well with this data, given that large search spaces are likely well-suited for this very fast tool!

EDIT: The shift between the overall distributions of FragPipe vs Sage seem to be a multiplicative constant and could be the result of using a different logarithm basis, maybe log2 vs ln, depending on the implementation of the X!Tandem score.

Best,

David

lazear commented 4 months ago

If you run a search without MSBooster rescoring (or just manually calculate q-values based on hyperscore) are the identification rates dramatically different?

As you noted, in general, the target distributions from Sage and Fragpipe look pretty similar... It is interesting that the overlap between target/decoy looks different. One potential thing to look at - Sage doesn't do anything special to handle the isobaric nature of I/L amino acids. IIRC, FragPipe treats them as the same w.r.t. making sure that a decoy peptide isn't present that matches a target peptide in the search space. For normal tryptic digests, I think it's rare that this has a significant effect, and since it complicates some downstream stuff I never implemented logic to handle it. It's possible that for non-specific digests it is actually having a big impact. You could test this by manually overwriting all I -> L in your FASTA and see if it makes a difference

theGreatHerrLebert commented 4 months ago

Hey Michael,

Thanks for your input. I re-ran the analysis with MSBooster turned off, and it provided some new insights:

Running with MSBooster turned off and calculating the FDR manually using the reported target and decoy scores yielded around 4,000 unique peptides at a q-value of 0.01, which is roughly twice the amount we get from SAGE. Letting FragPipe do the FDR control at 0.01 yielded around 16,000 unique peptides. This suggests that there is more involved in calculating the FDR; it cannot be hyperscore alone. Running with MSBooster turned on gave around 19,500 unique IDs at 0.01. I also observed that with MSBooster turned off, the hyperscores looked different:

So, I conclude that the hyperscore formula might be different from the one Sage uses. The scores shown in my initial post might already be the re-scored scores.

I also tested the I->L exchange but did not find a significant change in the distributions of scores. My current conclusion is that the poor performance in my setup could be caused by a combination of reasons.

I am still a little bit confused why the number of reported decoys is so much lower for FragPipe, but of course, there could be a more complicated mechanism at work to create them. Did you ever try out other scores for the initial generation of PSMs?

Best regards,

David

theGreatHerrLebert commented 3 months ago

Hey again,

We experimented with another hyperscore that we got from OpenMS. Here, instead of multiplying b/y ions intensity sums, they are added up, which effectively reduces the need to observe both of them in equal quantities. See below how this impacted a test dataset of ours (fixed FDR at 0.01), where different quantities of HLA1p were measured:

As can be observed, this score consistently boosts identifications (at least for this data type, though it would need to be checked in the general case as well) even after re-scoring, though gains after re-scoring are less pronounced. Likely, re-scoring compensates for many of the initially missed IDs. This data uses tunder-DDA-PASEF, a method that especially targets charge state 1 peptides, where it is expected to occur often that only a b or a y ion is observed but not both.

I think it could be beneficial to allow different scores and give the user the opportunity to choose them, defaulting to the original hyperscore. If you are interested in this, I can draft a PR for it. I will likely implement it so that other scores could be added later and made available as a parameter. What do you think? :)

Best,

David

lazear commented 3 months ago

Very interesting! I'd be happy to review a PR!

lazear / sage

Using Sage for HLA Peptides #139