snijderlab / stitch

Template-based assembly of proteomics short reads for de novo antibody sequencing and repertoire profiling
MIT License
22 stars 3 forks source link

how to set parameters to make sure only peptides with certain identity can be matched to templates? #251

Open irleader opened 1 day ago

irleader commented 1 day ago

Hi,

I am currently using the default mass_alphabet.txt, with parameters : EnforceUnique: 1, CutoffScore: 0, AmbiguityThreshold : 0.9.

How can I adjust the setting to make sure, only peptides with higher than 90% identitiy with template protein can be matched? This means, if peptide length is 10 AAs, it needs at least 9 AAs exactly the same as the protein to be matched.

Also, if I do not want to allow any gaps, how should I adjust the mass_alphabet.txt, I found the following might be of relevance:

GapStart   : -12
GapExtend  : -1
PatchLength: 3

Asymmetric sets ->
    Score: 1
    Sets :>
        X->A,R,N,D,C,Q,E,G,H,I,L,J,K,M,F,P,S,T,W,Y,V,B,Z -Remove mismatch penalty on single gaps
        A,R,N,D,C,Q,E,G,H,I,L,J,K,M,F,P,S,T,W,Y,V,B,Z->X -Remove mismatch penalty on single gaps
    <:
<-

Asymmetric sets ->
    Score: -4
    Sets :>
        .->A,R,N,D,C,Q,E,G,H,I,L,J,K,M,F,P,S,T,W,Y,V,B,Z -Add additional penalty on bigger gaps
        A,R,N,D,C,Q,E,G,H,I,L,J,K,M,F,P,S,T,W,Y,V,B,Z->. -Add additional penalty on bigger gaps
    <:
<-
douweschulte commented 1 day ago

The mass alphabet does not work in terms of identity it works on mass similarity. So changing the parameters to have a defined identity cutoff is not possible with the mass based alignment. If you want to only work with identity you could use the identity alphabet instead, and then provide a fitting cutoff. Note that this cutoff is not linearly normalised for peptide length so a definition of 90% identity irrespective of peptide length is not possible. This is done because we value larger peptides more even if they are relatively speaking worse than a smaller peptide. There is no way to fully disallow gaps, but setting the gapstart score to something very low (say -1000) should for all practical purposes fully disallow gaps.

Personally I always use the mass alphabet though, this allows for mistakes like GG to N match which have the same mass and so are often incorrectly identified by de novo peptide sequencing software. When using this we use a cutoff score range of roughly 8 to 20, but normally start at 10. Then after the first run we decrease this if only a very limited amount of peptides is placed or increase if we see too many false matches.

I hope this helps. If you have any more questions feel free to reach out again.

irleader commented 1 day ago

Thanks for the detailed explanation! I will then stick with mass alphabet as suggested. And I just tried, the minimum GapStart score is -128, I will use -128 then.

However, I am still a bit confused with the definition of CutoffScore in TemplateMatching, "The mean score per position needed for a path to be included in the Segment score." While in mass_alphabet.txt, identity for each position is only 8, for all other cases, the score per position is either lower than 8 or negative, which means the highest score per position can only be 8? How can we set a CutoffScore of 10 or even 20?

douweschulte commented 1 day ago

As detailed in the manual section on scoring the formula for finding the normalised score for a peptide is as follows:

matchScore >= cutoffScore * sqrt(min(readLength, templateLength))

As this does not scale linearly higher scores than the highest local score can still result in a match.