vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
283 stars 53 forks source link

Qusetion of quantification #1162

Open Gambrian opened 2 months ago

Gambrian commented 2 months ago

Dear Demichev, We want to repeat the results of another laboratory. We got the raw data of their experiment and all their diann parameters, but not their spectral library. They used the 1.8.1 version of diann (I have another question about the 1.9.1 version, which will be mentioned in another issue). We also used the same version. We used the fasta to build spectral and used the same parameters as them except for the database. The final result is that the number of proteins we identified is about 10% less than theirs, which is normal. They have more experience and accumulation.

But what I can't understand is that among the proteins that are identified, only 50% of our proteins have a Pearson correlation coefficient greater than 0.9 in the two search results. The correlation between the two searches of the same sample is 0.92-0.99 (similar to biological replication, but this is the search result from the same original file). In my understanding, the main difficulty in diann seems to be how to identify which mass spectrometry peaks the protein is composed of. After identifying which mass spectrometry peaks the protein is composed of, ideally, they should be almost exactly the same. So I think that although different databases are used, the number of identified proteins will change, but most of the identified proteins should have similar expression levels to be more reasonable. Is there something wrong with my understanding, or where should I find some introduction to quantification? Thank you very much.

This is an Astral protein project, and this is my search parameter (version 1.8.1)

diann.exe --f * (6 files) --lib --threads 16 --verbose 1 --out outpath\report.tsv --qvalue 0.01 --matrices --out-lib outpath\mus.tsv --gen-spec-lib --predictor --reannotate --fasta mus_musculus.fasta --fasta-search --min-fr-mz 150 --max-fr-mz 2000 --met-excision --cut K,R --missed-cleavages 2 --min-pep-len 5 --max-pep-len 52 --min-pr-mz 380 --max-pr-mz 980 --min-pr-charge 1 --max-pr-charge 6 --unimod4 --var-mods 1 --var-mod UniMod:35,15.994915,M --var-mod UniMod:1,42.010565,n --monitor-mod UniMod:1 --reanalyse --relaxed-prot-inf --smart-profiling --pg-level 1 --peak-center --no-ifs-removal

vdemichev commented 2 months ago

Hi,

As indicated in another thread, please never combine --fasta-search (FASTA digest) with analysis of raw files in one run/pipeline step.

But what I can't understand is that among the proteins that are identified, only 50% of our proteins have a Pearson correlation coefficient greater than 0.9 in the two search results.

If you are doing lib-free and they have a spectral library, then you quantify different peptides per protein, this is very much expected.

Considering that in some experiments up to 90% of proteins can have no detectable biological variation - only technical noise - the numbers you get are OK.

Best, Vadim

Gambrian commented 2 months ago

Hi,

Thank you very much for your answer. After further understanding, I found that they also used fasta files (different files, but the content is basically the same) to predict the spectral library. However, our quantitative results still showed that only 56% of the proteins had a Pearson correlation above 0.9. Is this correct? We used the same version of software, the same parameters,the same raw files, and very similar fasta files to search. By the way, the expression intensity was standardized using the diann R package, and the parameters are the sample code of your diann R package, and the process is exactly the same.

Best

vdemichev commented 2 months ago

our quantitative results still showed that only 56% of the proteins had a Pearson correlation above 0.9

Seems fine to me. 0.9 is a very high number for most experiments.