vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
261 stars 53 forks source link

Creating full reports in DIA-NN #24

Closed msmedus closed 4 years ago

msmedus commented 4 years ago

Hi Vadim,

First of all, congrats for creating this software, it really works very well! I have two questions for you (I am using version 1.7.6). I would like to create full reports (no 0s or NaNs) and for this I set both precursor q-value and protein q-value to 1. The idea behind is to get a full matrix, in which all missing values are 'imputed' (similar to the q-value filtering option in spectronaut). However, the output still contains some 0s and I cannot really explain why. I also tried to disable protein inference but it did not help either. Any idea why this is the case? Is there any solution to this problem? The second question is related to the reported number of protein groups. I did some tests, in which I analysed DIA raw files with either a spectral library (created with SN) or using your library-free search option (with deep learning enabled, human fasta swissprot with isoforms). In both cases, I enabled protein inference and for the library-based search I used the command '--library-headers ,,,,,,ProteinGroups' for getting gene name information in the output report. For the protein grouping, I kept the default 'genes' option. While the number of reported protein groups with the library-based search appears to be credible, in the library-free search it is heavily inflated. There are 12,000 or more reported protein groups and I it unclear to me why this is the case. I would assume that the protein grouping algorithm should be the same for both library-based and library-free searches if protein inference is enabled? Thanks for clarifying!

best, Martin

vdemichev commented 4 years ago

Hi Martin,

Thank you for your interest in DIA-NN!

DIA-NN indeed does not report identifications for precursors that are very low confidence (and yes, Spectronaut does). This is by design and a consequence of how the algorithms of DIA-NN work. It could be possible, of course, to just assign each such precursor to a random retention time at which there's at least some signal, however that would result in quantities that are so unreliable that they are effectively not better than just replacing NAs with random numbers (and a lot worse than using minimal value imputation).

Zeroes as quantities are the consequence of the fragment selection algorithm. DIA-NN selects top 3 fragments to be used for quantification in cross-run manner - this helps to deal with interferences. However it might happen that in some runs all of these 3 fragments are not detected in the data, leading to 0 as the quantity. This 0 should be treated as the best guess for the quantity (i.e. it might be that the precursor is indeed present at a very low level). However often it's a sign of misidentification (and it's very rare to see this at FDR < 1%), so you can also replace it with NA.

Protein inference does not affect precursor IDs/quantities. To get a full table, I would suggest using minimal value imputation (works at any FDR threshold). That's what we've been using for most of our experiments (on protein level, but one can also use it on precursor level).

Protein groups are formed based on uniprot isoform IDs. So if one protein group contains, say, 5 isoforms and another - these 5 plus an additional 6th one, then these protein groups are considered different. As in total there are over 70k human isoforms (and just over 20k genes), it's no surprise that library-free search can result in tens of thousands of protein groups. The numbers of unique genes reported are significantly less though. Spectronaut in general seems to form protein groups in a different way (a different implementation of the maximum parsimony algorithm), and usually results in less distinct protein groups reported.

When using a library from Spectronaut, please use the "Reannotate" option (below FASTA panel) in the last version of DIA-NN, which would fully enable DIA-NN's protein grouping algorithm. (Or disable "protein inference" in DIA-NN and thus rely entirely on the protein groups in the library). The reason is that without "Reannotate" DIA-NN assumes that it's the complete list of isoform IDs that is listed in the library, which is not the case with Spectronaut's libraries.

As far as I know, there's no need to use --library-headers when using spectral libraries exported from Spectronaut.

Hope this helps!

Best wishes,

Vadim