Closed baderd closed 4 years ago
Hi Daniel,
Thank you! Yes, indeed, protein IDs can be shared between different protein groups in DIA-NN. I have always assumed that Spectronaut's approach is the same, not sure about MaxQuant though. As far as I know, MaxQuant would typically report a lot more protein groups than Spectronaut for the same number of precursors, probably because of the respective protein grouping algorithms being very different. I think both DIA-NN and Spectronaut aim to reduce the number of proteins in each protein group and the number of groups using the maximum parsimony principle, while explaining all the peptides observed in the data.
There are several reasons why protein groups are not merged this way in DIA-NN. Suppose peptide X can originate from proteins A & B, peptide Y - from B & C, peptide Z - from A & C. DIA-NN will then have three groups: (1) A;B (2) B;C (3) A;C. If these proteins are grouped all together as A;B;C, it would give an impression that each of the peptides belongs to each of the three proteins, which is not the case. I think it would be very confusing for the user, even if explained in the manual. Another reason is that such merging can theoretically lead to protein groups getting arbitrary large and thus losing any biological meaning they might have had.
As the whole idea of protein groups (generated using the maximum parsimony algorithm) is very heuristical in nature, I would be very cautious about using protein groups in any kind of statistical inference anyway. Ourselves we just rely on proteins identified and quantified using proteotypic peptides, i.e. peptides which are not shared between multiple proteins (by default DIA-NN uses the "protein = gene product" definition to infer proteotypicity). So in most cases we rely on the Genes and Gene.Normalised.Unique columns (although the latter is often recalculated after filtering out inconsistently detected/quantified peptides based on the QC injections).
By the way, if one prefers protein grouping as per some other software tool, e.g. Spectronaut, if protein inference is disabled in DIA-NN, it will report exactly the same protein groups as specified in the spectral library.
Best wishes, Vadim
Hello Vadim,
A very belated but big thank you for your detailed answer!! We were thinking heavily about it and made some further investigations.
We found the same 3 Uniprot IDs being assembled to 2 different protein groups in the same sample: example_shuffeled_protein_group_single_sample.txt
This table is a DiaNN report.tsv
subsetted only on "File.name" and "Protein.group".
Is this behavior of DiaNN also expected?
Kind regards, Daniel
Hi Daniel,
Many thanks for reporting this! No, it's not by design, apparently this is due to a bug in the protein grouping algorithm.
Actually, the next version of DIA-NN which is currently in making has significantly improved protein grouping. I will look into the reason why the problem you've described happens and will make sure it does not manifest in this next version.
Best wishes,
Vadim
Great news! Looking forward to the next version! Could you mention this in the release notes, please?
Yes, I will. Protein grouping is one of the two main changes in the new version.
That is great to hear!!! Could you give a rough estimate for the next release? weeks or months? :-) Looking forward!
Maybe just several days :) Everything's basically ready, just need to do some more benchmarks, update documentation etc. Currently we are also preparing to run some covid patients' plasma, so things are a bit hectic with other projects and the timeline not very clear. Btw, the release will also feature a much better protein quantification strategy, based on MaxLFQ (method by Jürgen Cox and colleagues). It will be a 'development build' as usual recently.
Good luck!
Fyi, I also opened a general discussion in the R Bioconductor forum: https://support.bioconductor.org/p/129466/
Released DIA-NN 1.7.10. New protein grouping + MaxLFQ for protein quantification.
I looked in detail in what Spectronaut is doing, and how it makes sure UniProt Ids are not shared between protein groups. Basically, if protein A features peptides X, Y, Z, while protein B - peptides Z and U, what's going to happen is that X, Y and Z get assigned to A only (as A has more peptides identified in total), while B gets only U.
Advantage of such an approach: less protein IDs in each group, a bit lower number of groups in general. Disadvantage: protein groups are less reliable. In the example above, peptide Z will be used to quantify protein A, while it might actually have originated mostly from protein B. Note that both A and B are identified also with proteotypic peptides, so are both present in the sample.
I will think about implementing this as an option in future DIA-NN versions.
With DIA-NN's current algorithm, there would be groups A (quantified with X and Y), B (quantified with U) and A;B (quantified with Z). So DIA-NN would quantify both A and B using proteotypic peptides only.
Thanks for the clarification! I interpret Spectronaut the same way, good to have it confirmed.
Your current algorithm sounds also very logical to me. "Using peptides only once and most specifically", if I may say so.
My release alert is on...
Hello,
Thank you very much for the great work! On both sides the algorithmic (runtime, variance, number of ids) and the software management (open source, CC license, issues on github) you put academic proteomics software on a new level.
Problem description
We noted, that in the DiaNN output a UniProt accession can be shared between different protein groups. This is different from what we typically would expect, as in other tools (like Spectronaut or MaxQuant). To our knowledge such overlapping protein groups would be merged further, e.g. retaining the minimum set of accessions required to explain all observed peptides/precursors.
Questions
We are currently using DiaNN "v1.7.6".
Best, Daniel