vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
283 stars 53 forks source link

How to handle when protein groups id contains semicolons #1267

Closed Gambrian closed 3 days ago

Gambrian commented 4 days ago

Dear Vadim

I spent some time testing the performance of 1.9.2 and the impact on protein inference( on 1.9.1), and wanted to report back and get some advice.

  1. I tested 1.9.2 and 1.9.1 on 6 studies (5 astral and 1 dia-4d), and in most studies, 1.9.2 identified slightly more proteins and took slightly less time (by the way, the spectral library size was reduced a lot, half of that of 1.9.1)
  2. I compared the quantification of 1.9.2 and 1.9.1, and the correlation of most proteins was higher than 0.8, but in some samples with relatively low protein content, such as paraffin sections and swab samples, and in some studies without good reference proteome data, the correlation of proteins may be lower. To further explore whether low-quality protein quantification results have lower correlation, I picked proteins with CV in the top 25%, and they did have higher correlation. But ① I still want to know how can I be sure that I have quantified my proteins well?
  3. Regarding the problem I mentioned before( https://github.com/vdemichev/DiaNN/issues/1224#issuecomment-2434174497 ), in some studies, about 1/3 of the proteome IDs are accession IDs composed of ";". And this is what I get by setting protein inference to "Protein Name (from fasta)", "Isoform ID", and "Gene"
diann_result_gene <- diann_load(file.path(work_dir_gene,"report.tsv"))
abundance_df_gene <- diann_maxlfq(diann_result_gene, group.header="Protein.Group", id.header = "Precursor.Id", quantity.header = "Precursor.Normalised")

diann_result_isoform <- diann_load(file.path(work_dir_isoform,"report.tsv"))
abundance_df_isoform <- diann_maxlfq(diann_result_isoform, group.header="Protein.Group", id.header = "Precursor.Id", quantity.header = "Precursor.Normalised")

diann_result_protein <- diann_load(file.path(work_dir_protein,"report.tsv"))
abundance_df_protein <- diann_maxlfq(diann_result_protein, group.header="Protein.Group", id.header = "Precursor.Id", quantity.header = "Precursor.Normalised")

> ## gene
> str_detect(diann_result_gene$Protein.Group,";") %>% sum() 
[1] 408760
> str_detect(diann_result_gene$Protein.Group,"cRAP-") %>% sum() 
[1] 1546
> nrow(diann_result_gene)
[1] 1297780
> 
> str_detect(abundance_df_gene %>% rownames(),";") %>% sum() 
[1] 3130
> str_detect(rownames(abundance_df_gene),"cRAP-") %>% sum()
[1] 25
> nrow(abundance_df_gene)
[1] 8062
> 
> str_detect(diann_result_gene$Genes,";") %>% sum()
[1] 24190
> 
> ## isoform
> str_detect(diann_result_isoform$Protein.Group,";") %>% sum() 
[1] 376122
> str_detect(diann_result_isoform$Protein.Group,"cRAP-") %>% sum() 
[1] 3345
> nrow(diann_result_isoform)
[1] 1297780
> 
> str_detect(abundance_df_isoform %>% rownames(),";") %>% sum() 
[1] 3091
> str_detect(rownames(abundance_df_isoform),"cRAP-") %>% sum()
[1] 29
> nrow(abundance_df_isoform)
[1] 8170
> 
> str_detect(diann_result_isoform$Genes,";") %>% sum()
[1] 24574
> 
> ## protein name
> str_detect(diann_result_protein$Protein.Group,";") %>% sum() 
[1] 376122
> str_detect(diann_result_protein$Protein.Group,"cRAP-") %>% sum() 
[1] 3345
> nrow(diann_result_protein)
[1] 1297780
> 
> str_detect(abundance_df_protein %>% rownames(),";") %>% sum() 
[1] 3091
> str_detect(rownames(abundance_df_protein),"cRAP-") %>% sum()
[1] 29
> nrow(abundance_df_protein)
[1] 8170
> 
> str_detect(diann_result_protein$Genes,";") %>% sum()
[1] 24574

So setting protein inference to "Isoform ID" does reduce this, but the effect is not significant. As you predicted, most results do not have ";" in the gene column, so when there is no ";" in the gene column of this row, I want to keep the first access ID, and delete the row when there is ";" in the gene column to ensure that the results are annotated as much as possible using the uniprot functional annotation. ② Do you think this is the right way? ③ At the same time, there is another small question. Since diann has normalized the results, should I not use diann_maxlfq to obtain the expression matrix, but use diann_matrix to obtain the results directly?

Best

Originally posted by @Gambrian in https://github.com/vdemichev/DiaNN/issues/1224#issuecomment-2470266290

vdemichev commented 4 days ago

Hi,

and the correlation of most proteins was higher than 0.8

Important point: DIA-NN quantifies proteins for a single purpose: comparison of the levels of each protein between samples in the same analysis. Absolute protein quantification is not the goal, hence absolute protein quantities are not meant to be comparable between different DIA-NN versions or settings.

delete the row when there is ";" in the gene column

For most analyses - yes.

use diann_matrix to obtain the results directly?

Yes, unless you want to either filter peptides somehow or adjust their quantities.

Best, Vadim

Gambrian commented 3 days ago

Thank you so much.

Best