Open irleader opened 1 month ago
Sounds like a nice approach to do some preprocessing and convert everything to fasta. It is important to know though that fasta does not contain a lot of metadata that stitch is able to read.
I would suggest you use the mass_alphabet.txt
you are already using. I tried to use scores that encourage identical placement over isobaric placement. You can try ramping these scores a bit and see if that behaves more how you would expect it to. Below is the mass alphabet with some inline comments (all valid in the syntax) to guide your efforts a bit.
Alphabet ->
Characters : ARNDCQEGHILKMFPSTWYVBZXJ.*
Identity : 8 - position score for identity
Mismatch : -1
GapStart : -12
GapExtend : -1
PatchLength: 3 - maximum size of swaps/rotations
Swap : 2 - per position
Symmetric sets ->
- Scores 8 for I<->L<->J
Score: 8
Sets :>
I,L,J
<:
<-
Symmetric sets ->
- Scores 6 for isobaric sequences, these are normalised to be ordered in the same order as the alphabet to prevent rotations (AG vs GA) to be scored differently
- Note that these allows this scoring for any occurrence of these amino acids, the local modifications are lost for all types of input peptides at this point in Stitch
Score: 6
Sets :>
N,GG
Q,AG
AV,GL,GI,GJ
AN,QG,AGG
LS,IS,JS,TV
AM,CV
NV,AAA,GGV
NT,QS,AGS,GGT
LN,IN,JN,QV,AGV,GGL,GGI,GGJ
DL,DI,DJ,EV
QT,AAS,AGT
AY,FS
LQ,IQ,AAV,AGL,AGI,AGJ
NQ,ANG,QGG
KN,GGK
EN,DQ,ADG,EGG
DK,AAT,GSV
MN,AAC,GGM
AS,GT
AAL,AAI,AAJ,GVV
QQ,AAN,AQG
EQ,AAD,AEG
EK,ASV,GLS,GIS,GJS,GTV
MQ,AGM,CGV
AAQ,NGV
<:
<-
Asymmetric sets ->
- Scores 1 specifically for X->Anything and Anything -> X
Score: 1
Sets :>
X->A,R,N,D,C,Q,E,G,H,I,L,J,K,M,F,P,S,T,W,Y,V,B,Z -Remove mismatch penalty on single gaps
A,R,N,D,C,Q,E,G,H,I,L,J,K,M,F,P,S,T,W,Y,V,B,Z->X -Remove mismatch penalty on single gaps
<:
<-
Asymmetric sets ->
Score: -4
Sets :>
.->A,R,N,D,C,Q,E,G,H,I,L,J,K,M,F,P,S,T,W,Y,V,B,Z -Add additional penalty on bigger gaps
A,R,N,D,C,Q,E,G,H,I,L,J,K,M,F,P,S,T,W,Y,V,B,Z->. -Add additional penalty on bigger gaps
<:
<-
Asymmetric sets ->
- Here are all common modifications listed, increasing the score of this set would allow these more often, you could also break this up into two different `asymmetric set` definitions, one with your modifications of interest with a high score and one with the others with a low score
Score: 3
Sets :>
- Template sequence -- results -> in read sequence - Type
Q->E -Deamidation
N,GG->D -Deamidation
T->D -Methylation
S->T -Methylation
D->E -Methylation
R->AV,GL,GI,GJ -Methylation
Q->AA -Methylation
W->DS,AM,CV,TT -Oxidation
M->F -Oxidation
S->E -Acetylation
K->AV,GL,GI,GJ -Acetylation/Homoarginine
<:
<-
<-
If you have any more questions feel free to reach out!
Thanks a lot for your prompt answer! I will try with different mass_alphabet.txt. A few more questions:
As you have mentioned that you are using positional amino acid scores in the assembly, I am thinking of transforming output file from different de novo methods into a uniform format which contains amino acid score information as well instead of using the fasta format. Do you have any suggested format which is compatiable with current input modes (– Reads– FASTA– Peaks– mmCIF– MaxNovo– Novor– Casanovo– pNovo– Folder)?
I personally perfer the csv format that Peaks X+ output is using. It contains the following column names: Fraction,Source File,Feature,Peptide,Scan,Tag Length,Denovo Score,ALC (%),length,m/z,z,RT,Predict RT,Area,Mass,ppm,PTM,local confidence (%),tag (>=0%),mode.
If I would like to use Peaks X+ mode with a csv file transformed from output of other de novo methods, what are the minimum necessary columns I have to put in (I prefer not to use CutoffScore but do a peptide score filtering myself)?
I assume "Peptide" and "local confidence (%)" should be enough for all the information requried by stitch assembly? Or "ALC (%)" is still a must due to hard coding in Peaks X+ mode?
I am looking forward to your feedback!
ps: by the way, the other assembly algorithm ALPS requires an input csv file with 5 columns in the following order: "Spectrum Name","Peptide","aaScore","Score" and "Area"
Personally I would pick the Peaks format as well, as I have most experience using that. Stitch will read in the full file with all columns and assumes them to be in order. 'Peptide', 'ALC (%)', 'local confidence (%)', and 'area' are the ones that are actually used, the rest is retained for display in the HTML report. For all other columns there has to be data in the format that peaks uses but the data can otherwise be meaningless. On the 'area' column, this is used to rescore peptides based on the pseudo quantitation, this column is normalised between the minimal and maximal value present, so if you would put in the same value (say 1) on every row you will remove this column from having any effect. Together these columns are very similar to the ones listed for ALPS. I recently have been thinking about allowing mzIdentML data for the next Stitch version I am building to cater the use case of preselecting or custom format conversion better.
I am now converting results from other methods all into the mztab format from casanovo and use Casanovo mode to do stitch assembly. I am wondering if stitch uses modification information from casanovo, e.g. C+57.021, M+15.995, N+0.984, Q+0.984.
For peptide sequence from other methods with modifications, shall I keep all modification information (by converting e.g. IKEM(+15.99)FG to IKEM+15.995FG)?
If the modification information is not used, I wil remove all modifications and keep sequence as IKEMFG to make life easy and avoid making any conversion mistakes.
Stitch will not use the modifications for the alignments, but if you use the raw data display capabilities you will need to have the modifications in the sequences, otherwise the annotations will be wrong. On the format, it is a very loose interpretation of ProForma 2.0 so AA(mod)
or AA[mod]
both work, and for the mods anything unimod or straight up numbers will work. And I did quite some work to try and allow modified sequences from many sources in one function in the code so if you use sequences from different formats that all are supported by Stitch you can just leave them like they are and it should all work out.
Thanks for your prompt answer! I might not use the raw data display capabilities and I would like to use a uniform format for all de novo methods. Thes easiest way is to remove all modfications from sequence to avoid any potential mistake.
Hi Douwe Schulte,
Thanks a lot for developing such a great tool which benefits the whole proteomics community!
I am using several de novo peptideing sequencing tools (including casanovo) that different tools have different ouput formats (csv,mztab,txt). Thus I prefer processing and filtering the original output and convert to fasta befrore assembly such that I know which of the peptides are used for assembly.
However fasta file does not contain modification information, that stitch does not know what modfications there are in my dataset, I might need to modify the alphabet file to better accomodate my case. I have two parameter setting questions which require your answer:
What is the suggested EnforceUnique, CutoffScore and AmbiguityThreshold for fasta file input? If I am working on the famous herceptin dataset with Casanovo? I have attached my batchfile. Casanovo_herceptin_HeavyChain.txt
I have only fixed modification of "C+57.021", as well as varibale modifications of "M+15.995", "N+0.984", "Q+0.984". Such that the mass of N might be equal to D, and the mass of Q might be equal to E (de novo methods cannot differentiate).
Which alphabet file do you suggest to use (I am using mass_alphabet.txt)?
In this file, I see I;L;J are of score 8, while Q->E; N,GG->D and other modifications are of score 3. Can I lift Q->E; N->D to score 8 or 6 and remove all other modifications either there are no such modifications in the dataset or if I believe my dataset is of high mass resolution that we can differentiate all the rest. Can you provide a modified mass_alphabet.txt for me to learn from? I am not sure about the syntax.
I am very looking forward to your reply. Thanks a lot in advance!