vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
261 stars 53 forks source link

Uniprot IDs appear in multiple protein groups per RAW file #22

Closed baderd closed 4 years ago

baderd commented 4 years ago

Hello,

Thank you very much for the great work! On both sides the algorithmic (runtime, variance, number of ids) and the software management (open source, CC license, issues on github) you put academic proteomics software on a new level.

Problem description

We noted, that in the DiaNN output a UniProt accession can be shared between different protein groups. This is different from what we typically would expect, as in other tools (like Spectronaut or MaxQuant). To our knowledge such overlapping protein groups would be merged further, e.g. retaining the minimum set of accessions required to explain all observed peptides/precursors.

Questions

  1. Is there a specific reason why you decided against such a futrther aggregation? At least in our data it seems in this way the number of distinct protein groups would tend to overestimate the actual protein numbers, which might negatively affect downstreatm analysis like significance tests.
  2. How do you handle the relationship between Uniprot accession and protein groups in your own data? Or do you maybe use further aggregated data in your downstreatm analyses instead of the "Protein.Group" column from "report.tsv" file?

We are currently using DiaNN "v1.7.6".

Best, Daniel

vdemichev commented 4 years ago

Hi Daniel,

Thank you! Yes, indeed, protein IDs can be shared between different protein groups in DIA-NN. I have always assumed that Spectronaut's approach is the same, not sure about MaxQuant though. As far as I know, MaxQuant would typically report a lot more protein groups than Spectronaut for the same number of precursors, probably because of the respective protein grouping algorithms being very different. I think both DIA-NN and Spectronaut aim to reduce the number of proteins in each protein group and the number of groups using the maximum parsimony principle, while explaining all the peptides observed in the data.

There are several reasons why protein groups are not merged this way in DIA-NN. Suppose peptide X can originate from proteins A & B, peptide Y - from B & C, peptide Z - from A & C. DIA-NN will then have three groups: (1) A;B (2) B;C (3) A;C. If these proteins are grouped all together as A;B;C, it would give an impression that each of the peptides belongs to each of the three proteins, which is not the case. I think it would be very confusing for the user, even if explained in the manual. Another reason is that such merging can theoretically lead to protein groups getting arbitrary large and thus losing any biological meaning they might have had.

As the whole idea of protein groups (generated using the maximum parsimony algorithm) is very heuristical in nature, I would be very cautious about using protein groups in any kind of statistical inference anyway. Ourselves we just rely on proteins identified and quantified using proteotypic peptides, i.e. peptides which are not shared between multiple proteins (by default DIA-NN uses the "protein = gene product" definition to infer proteotypicity). So in most cases we rely on the Genes and Gene.Normalised.Unique columns (although the latter is often recalculated after filtering out inconsistently detected/quantified peptides based on the QC injections).

By the way, if one prefers protein grouping as per some other software tool, e.g. Spectronaut, if protein inference is disabled in DIA-NN, it will report exactly the same protein groups as specified in the spectral library.

Best wishes, Vadim

baderd commented 4 years ago

Hello Vadim,

A very belated but big thank you for your detailed answer!! We were thinking heavily about it and made some further investigations.

We found the same 3 Uniprot IDs being assembled to 2 different protein groups in the same sample: example_shuffeled_protein_group_single_sample.txt

This table is a DiaNN report.tsv subsetted only on "File.name" and "Protein.group".

Is this behavior of DiaNN also expected?

Kind regards, Daniel

vdemichev commented 4 years ago

Hi Daniel,

Many thanks for reporting this! No, it's not by design, apparently this is due to a bug in the protein grouping algorithm.

Actually, the next version of DIA-NN which is currently in making has significantly improved protein grouping. I will look into the reason why the problem you've described happens and will make sure it does not manifest in this next version.

Best wishes,

Vadim

baderd commented 4 years ago

Great news! Looking forward to the next version! Could you mention this in the release notes, please?

vdemichev commented 4 years ago

Yes, I will. Protein grouping is one of the two main changes in the new version.

baderd commented 4 years ago

That is great to hear!!! Could you give a rough estimate for the next release? weeks or months? :-) Looking forward!

vdemichev commented 4 years ago

Maybe just several days :) Everything's basically ready, just need to do some more benchmarks, update documentation etc. Currently we are also preparing to run some covid patients' plasma, so things are a bit hectic with other projects and the timeline not very clear. Btw, the release will also feature a much better protein quantification strategy, based on MaxLFQ (method by Jürgen Cox and colleagues). It will be a 'development build' as usual recently.

baderd commented 4 years ago

Good luck!

Fyi, I also opened a general discussion in the R Bioconductor forum: https://support.bioconductor.org/p/129466/

vdemichev commented 4 years ago

Released DIA-NN 1.7.10. New protein grouping + MaxLFQ for protein quantification.

vdemichev commented 4 years ago

I looked in detail in what Spectronaut is doing, and how it makes sure UniProt Ids are not shared between protein groups. Basically, if protein A features peptides X, Y, Z, while protein B - peptides Z and U, what's going to happen is that X, Y and Z get assigned to A only (as A has more peptides identified in total), while B gets only U.

Advantage of such an approach: less protein IDs in each group, a bit lower number of groups in general. Disadvantage: protein groups are less reliable. In the example above, peptide Z will be used to quantify protein A, while it might actually have originated mostly from protein B. Note that both A and B are identified also with proteotypic peptides, so are both present in the sample.

I will think about implementing this as an option in future DIA-NN versions.

With DIA-NN's current algorithm, there would be groups A (quantified with X and Y), B (quantified with U) and A;B (quantified with Z). So DIA-NN would quantify both A and B using proteotypic peptides only.

baderd commented 4 years ago

Thanks for the clarification! I interpret Spectronaut the same way, good to have it confirmed.

Your current algorithm sounds also very logical to me. "Using peptides only once and most specifically", if I may say so.

My release alert is on...