Open jjGG opened 3 months ago
Hi Jonas,
Which DIA-NN version are you using?
Best, Vadim
Ah, got it :) Nice catch. So what happens, WP_014262366.1 has a 'name' which is 'hypothetical', while ADW16141.1 does not have a 'name', because the word 'type' that follows the sequence ID is too short to be a name. Here what DIA-NN does is tries to extract info from the header as if it were uniprot-style, thus leading to this incorrect interpretation. Solutions:
Best, Vadim
Hello Vadim,
Here we used "/usr/diann/1.8.2_beta_8/linux/diann-1.8.1.8"
Thanks for the fast response and your answer. I was not aware that DIANN is also very sensitive with the description lines of the fasta files. Thanks for the hint with --isoforms
best regards jonas
Hello Vadim,
Do you suggest to use the:
--pg-level 0
For all non-uniprot databases?
We do routinely have species-specific sequencing derived databases that we search that are usually NOT from uniprot and also lack certain uniprot-like features (e.g. GN= ) Or would you suggest to "adapt" them to make them more uniprot like? (eg.:
fg|ProteinAccessions|SomeName_FGCZ here can go anything or does it need to have a particular length
It is unclear for me how the rest of the description line (everything after "space") affects the search results?
best regards jonas
For all non-uniprot databases?
Forl all FASTAs from which protein and gene names are not read correctly.
Or would you suggest to "adapt" them to make them more uniprot like?
Should be fairly easy with R or Python packages for handling FASTAs.
Hello Vadim, I am puzzled a litte. Looking at the main output from DIANN. (report.tsv)
I always thought that in the "Protein.Group" column you are listing the "winner protein(s) or the protein-group" while not parsimony principle there is a logic behind which are the "winner proteins" -> taken all peptides for a group of proteins in consideration. While in "Protein.Ids" column you would report all proteins where the peptide under question would appear.
Now in one of my searches I do have a fasta file where I have two completely identical protein sequences (identical sequence and length) but with different accessions and header lines. (I agree that this does not make too much sense but still it should not be problematic and can even be real that there are two identical proteins encoded at different loci) I would expect that both of these proteins are listed in the "Protein.Group" column for all the identified peptides for these proteins.
Here I find only one protein in Protein.Ids (while still in this column sometimes there are two proteins separated by semicolon?) WP_014262366.1
While in Protein.Ids: ADW16141.1;WP_014262366.1
Any explanation for this behaviour?
Best regards jonas
From report.tsv output: File.Name Run Protein.Group Protein.Ids Protein.Names Genes PG.Quantity PG.Normalised PG.MaxLFQ Genes.Quantity Genes.Normalised Genes.MaxLFQ Genes.MaxLFQ.Unique Modified.Sequence /scratch/DIANN_A314/WU305725/20240709_C35673_003r_S722096_Fa1_2_Group_1.mzML 20240709_C35673_003r_S722096_Fa1_2_Group_1 WP_014262366.1 ADW16141.1;WP_014262366.1 hypothetical hypothetical 82334.9 82334.9 74703.5 2.4349e+07 2.4349e+07 1.54694e+07 DIFNFISR /scratch/DIANN_A314/WU305725/20240709_C35673_003r_S722096_Fa1_2_Group_1.mzML 20240709_C35673_003r_S722096_Fa1_2_Group_1 WP_014262366.1 ADW16141.1;WP_014262366.1 hypothetical hypothetical 82334.9 82334.9 74703.5 2.4349e+07 2.4349e+07 1.54694e+07 IAYDLILTSK /scratch/DIANN_A314/WU305725/20240709_C35673_003r_S722096_Fa1_2_Group_1.mzML 20240709_C35673_003r_S722096_Fa1_2_Group_1 WP_014262366.1 ADW16141.1;WP_014262366.1 hypothetical hypothetical 82334.9 82334.9 74703.5 2.4349e+07 2.4349e+07 1.54694e+07 IFAGAGNDR /scratch/DIANN_A314/WU305725/20240709_C35673_003r_S722096_Fa1_2_Group_1.mzML 20240709_C35673_003r_S722096_Fa1_2_Group_1 WP_014262366.1 ADW16141.1;WP_014262366.1 hypothetical hypothetical 82334.9 82334.9 74703.5 2.4349e+07 2.4349e+07 1.54694e+07 IFAITNNDLGEDVEK /scratch/DIANN_A314/WU305725/20240709_C35673_011r_S722097_Fa1_3_Group_1.mzML 20240709_C35673_011r_S722097_Fa1_3_Group_1 WP_014262366.1 ADW16141.1;WP_014262366.1 hypothetical hypothetical 47893.8 47893.8 52838.7 1.27546e+07 1.27546e+07 1.53814e+07 DIFNFISR /scratch/DIANN_A314/WU305725/20240709_C35673_011r_S722097_Fa1_3_Group_1.mzML 20240709_C35673_011r_S722097_Fa1_3_Group_1 WP_014262366.1 ADW16141.1;WP_014262366.1 hypothetical hypothetical 47893.8 47893.8 52838.7 1.27546e+07 1.27546e+07 1.53814e+07 IAYDLILTSK /scratch/DIANN_A314/WU305725/20240709_C35673_011r_S722097_Fa1_3_Group_1.mzML 20240709_C35673_011r_S722097_Fa1_3_Group_1 WP_014262366.1 ADW16141.1;WP_014262366.1 hypothetical hypothetical 47893.8 47893.8 52838.7 1.27546e+07 1.27546e+07 1.53814e+07 IFAITNNDLGEDVEK
According fasta entries: