snijderlab / stitch

Template-based assembly of proteomics short reads for de novo antibody sequencing and repertoire profiling
MIT License
21 stars 3 forks source link

Suggested setting for fasta file as input with certain modifications #250

Open irleader opened 1 month ago

irleader commented 1 month ago

Hi Douwe Schulte,

Thanks a lot for developing such a great tool which benefits the whole proteomics community!

I am using several de novo peptideing sequencing tools (including casanovo) that different tools have different ouput formats (csv,mztab,txt). Thus I prefer processing and filtering the original output and convert to fasta befrore assembly such that I know which of the peptides are used for assembly.

However fasta file does not contain modification information, that stitch does not know what modfications there are in my dataset, I might need to modify the alphabet file to better accomodate my case. I have two parameter setting questions which require your answer:

  1. What is the suggested EnforceUnique, CutoffScore and AmbiguityThreshold for fasta file input? If I am working on the famous herceptin dataset with Casanovo? I have attached my batchfile. Casanovo_herceptin_HeavyChain.txt

  2. I have only fixed modification of "C+57.021", as well as varibale modifications of "M+15.995", "N+0.984", "Q+0.984". Such that the mass of N might be equal to D, and the mass of Q might be equal to E (de novo methods cannot differentiate).

    Which alphabet file do you suggest to use (I am using mass_alphabet.txt)?

    In this file, I see I;L;J are of score 8, while Q->E; N,GG->D and other modifications are of score 3. Can I lift Q->E; N->D to score 8 or 6 and remove all other modifications either there are no such modifications in the dataset or if I believe my dataset is of high mass resolution that we can differentiate all the rest. Can you provide a modified mass_alphabet.txt for me to learn from? I am not sure about the syntax.

I am very looking forward to your reply. Thanks a lot in advance!

douweschulte commented 1 month ago

Sounds like a nice approach to do some preprocessing and convert everything to fasta. It is important to know though that fasta does not contain a lot of metadata that stitch is able to read.

  1. Exactly the same as you would use for other sources of the same files.
  2. I would suggest you use the mass_alphabet.txt you are already using. I tried to use scores that encourage identical placement over isobaric placement. You can try ramping these scores a bit and see if that behaves more how you would expect it to. Below is the mass alphabet with some inline comments (all valid in the syntax) to guide your efforts a bit.

    Alphabet ->
    Characters : ARNDCQEGHILKMFPSTWYVBZXJ.*
    Identity   : 8 - position score for identity
    Mismatch   : -1
    GapStart   : -12
    GapExtend  : -1
    PatchLength: 3 - maximum size of swaps/rotations
    Swap       : 2 - per position
    
    Symmetric sets ->
        - Scores 8 for I<->L<->J
        Score: 8
        Sets :>
            I,L,J
        <:
    <-
    
    Symmetric sets ->
        - Scores 6 for isobaric sequences, these are normalised to be ordered in the same order as the alphabet to prevent rotations (AG vs GA) to be scored differently
        - Note that these allows this scoring for any occurrence of these amino acids, the local modifications are lost for all types of input peptides at this point in Stitch
        Score: 6
        Sets :>
            N,GG
            Q,AG
            AV,GL,GI,GJ
            AN,QG,AGG
            LS,IS,JS,TV
            AM,CV
            NV,AAA,GGV
            NT,QS,AGS,GGT
            LN,IN,JN,QV,AGV,GGL,GGI,GGJ
            DL,DI,DJ,EV
            QT,AAS,AGT
            AY,FS
            LQ,IQ,AAV,AGL,AGI,AGJ
            NQ,ANG,QGG
            KN,GGK
            EN,DQ,ADG,EGG
            DK,AAT,GSV
            MN,AAC,GGM
            AS,GT
            AAL,AAI,AAJ,GVV
            QQ,AAN,AQG
            EQ,AAD,AEG
            EK,ASV,GLS,GIS,GJS,GTV
            MQ,AGM,CGV
            AAQ,NGV
        <:
    <-
    
    Asymmetric sets ->
        - Scores 1 specifically for X->Anything and Anything -> X
        Score: 1
        Sets :>
            X->A,R,N,D,C,Q,E,G,H,I,L,J,K,M,F,P,S,T,W,Y,V,B,Z -Remove mismatch penalty on single gaps
            A,R,N,D,C,Q,E,G,H,I,L,J,K,M,F,P,S,T,W,Y,V,B,Z->X -Remove mismatch penalty on single gaps
        <:
    <-
    
    Asymmetric sets ->
        Score: -4
        Sets :>
            .->A,R,N,D,C,Q,E,G,H,I,L,J,K,M,F,P,S,T,W,Y,V,B,Z -Add additional penalty on bigger gaps
            A,R,N,D,C,Q,E,G,H,I,L,J,K,M,F,P,S,T,W,Y,V,B,Z->. -Add additional penalty on bigger gaps
        <:
    <-
    
    Asymmetric sets ->
        - Here are all common modifications listed, increasing the score of this set would allow these more often, you could also break this up into two different `asymmetric set` definitions, one with your modifications of interest with a high score and one with the others with a low score
        Score: 3
        Sets :>
            - Template sequence -- results -> in read sequence - Type
            Q->E -Deamidation
            N,GG->D -Deamidation
            T->D -Methylation
            S->T -Methylation
            D->E -Methylation
            R->AV,GL,GI,GJ -Methylation
            Q->AA -Methylation
            W->DS,AM,CV,TT -Oxidation
            M->F -Oxidation
            S->E -Acetylation
            K->AV,GL,GI,GJ -Acetylation/Homoarginine
        <:
    <-
    <-

If you have any more questions feel free to reach out!

irleader commented 1 month ago

Thanks a lot for your prompt answer! I will try with different mass_alphabet.txt. A few more questions:

  1. "It is important to know though that fasta does not contain a lot of metadata that stitch is able to read." I am very interested if stitch uses positional amino acid score from Casanovo, if yes, using the mztab file directly might be a better idea.
  2. If there is only one protein template, EnforceUnique should be OK with any value?
  3. CutoffScore is used to filter the peptide predictions with a peptide score lower than threshold, thus for fasta file, all peptides will be included for assembly regardless of CutoffScore?
  4. AmbiguityThreshold will not affect the assembly algorithm, but only determines whether an alignment position is ambiguous or not after assembly, right?
douweschulte commented 1 month ago
  1. Yes it does.
  2. Yes it will not change anything. Adding decoy / common contaminants will add more templates though so then it matters again.
  3. Indeed, if you do preprocessing to generate the fasta file you will have to do the score filtering in that step if you want to filter.
  4. Yes, it will not change any major part of the assembly. It changes how the ambiguity graph is generated and which positions are underlined in the sequence consensus overview.
irleader commented 1 month ago

As you have mentioned that you are using positional amino acid scores in the assembly, I am thinking of transforming output file from different de novo methods into a uniform format which contains amino acid score information as well instead of using the fasta format. Do you have any suggested format which is compatiable with current input modes (– Reads– FASTA– Peaks– mmCIF– MaxNovo– Novor– Casanovo– pNovo– Folder)?

I personally perfer the csv format that Peaks X+ output is using. It contains the following column names: Fraction,Source File,Feature,Peptide,Scan,Tag Length,Denovo Score,ALC (%),length,m/z,z,RT,Predict RT,Area,Mass,ppm,PTM,local confidence (%),tag (>=0%),mode.

If I would like to use Peaks X+ mode with a csv file transformed from output of other de novo methods, what are the minimum necessary columns I have to put in (I prefer not to use CutoffScore but do a peptide score filtering myself)?

I assume "Peptide" and "local confidence (%)" should be enough for all the information requried by stitch assembly? Or "ALC (%)" is still a must due to hard coding in Peaks X+ mode?

I am looking forward to your feedback!

ps: by the way, the other assembly algorithm ALPS requires an input csv file with 5 columns in the following order: "Spectrum Name","Peptide","aaScore","Score" and "Area"

douweschulte commented 1 month ago

Personally I would pick the Peaks format as well, as I have most experience using that. Stitch will read in the full file with all columns and assumes them to be in order. 'Peptide', 'ALC (%)', 'local confidence (%)', and 'area' are the ones that are actually used, the rest is retained for display in the HTML report. For all other columns there has to be data in the format that peaks uses but the data can otherwise be meaningless. On the 'area' column, this is used to rescore peptides based on the pseudo quantitation, this column is normalised between the minimal and maximal value present, so if you would put in the same value (say 1) on every row you will remove this column from having any effect. Together these columns are very similar to the ones listed for ALPS. I recently have been thinking about allowing mzIdentML data for the next Stitch version I am building to cater the use case of preselecting or custom format conversion better.

irleader commented 3 weeks ago

I am now converting results from other methods all into the mztab format from casanovo and use Casanovo mode to do stitch assembly. I am wondering if stitch uses modification information from casanovo, e.g. C+57.021, M+15.995, N+0.984, Q+0.984.

For peptide sequence from other methods with modifications, shall I keep all modification information (by converting e.g. IKEM(+15.99)FG to IKEM+15.995FG)?

If the modification information is not used, I wil remove all modifications and keep sequence as IKEMFG to make life easy and avoid making any conversion mistakes.

douweschulte commented 2 weeks ago

Stitch will not use the modifications for the alignments, but if you use the raw data display capabilities you will need to have the modifications in the sequences, otherwise the annotations will be wrong. On the format, it is a very loose interpretation of ProForma 2.0 so AA(mod) or AA[mod] both work, and for the mods anything unimod or straight up numbers will work. And I did quite some work to try and allow modified sequences from many sources in one function in the code so if you use sequences from different formats that all are supported by Stitch you can just leave them like they are and it should all work out.

irleader commented 2 weeks ago

Thanks for your prompt answer! I might not use the raw data display capabilities and I would like to use a uniform format for all de novo methods. Thes easiest way is to remove all modfications from sequence to avoid any potential mistake.