vdemichev / DiaNN

DIA-NN - a universal automated software suite for DIA proteomics data analysis.
Other
272 stars 53 forks source link

How to add protein IDs to "unique_genes_matrix.tsv" in DIANN1.8 #763

Closed clubmann closed 1 year ago

clubmann commented 1 year ago

Dear Demichev,

Unlike ver1.7, in ver1.8, only gene names are assigned to "unique_genes_matrix" in addition to quantitative values. Therefore, if I would like to get protein ID information, I need to use ex. vlookup function to link protein IDs from "pg.matrix" or "pr.matrix" relying on gene names. In doing so, there may be cases where two protein IDs are not assigned properly for the same gene name, although this is rare. Please let me know how to solve this problem.

Thank you very much for your support. Best regards, clubmann

vdemichev commented 1 year ago

Hi clubmann,

Can you please describe the issue in mode detail with screenshots of how things look & should look? In general, if there's any problem with the matrices, please just use the main report.

Best, Vadim

clubmann commented 1 year ago

Hi Vadim,

Thank you for your quick reply. Here is an example of pg.matrix which has a redundant in gene name. Ex_pg_matrix

The vlookup function in excel picks up either gene name. Ex_vlookup

Should I choose one gene name and delete the another one in the pg_matrix before performing vlookup? The main report was too big to open in the excel in this case.

Best regards, clubmann

vdemichev commented 1 year ago

'Protein.Names' and 'Genes' refer to different things, the first one is not a gene name but rather a 'protein name' as encoded in UniProt. For example, there are over >70k unique protein names in human proteome but just over 20k genes. Does this clarify the question?

clubmann commented 1 year ago

The Uniprot accession number is suitable for us to use for our downstream analysis. However, the first output file "unique_genes_matrix" doesn't provide the Uniprot accession number but give us gene name. So, my question is what is the best way to connect their information?

vdemichev commented 1 year ago

If you are only interested in gene names, you can use unique_genes matrix. Otherwise use pg_matrix that has both?

clubmann commented 1 year ago

That would be an alternative way, but the quantification values are different between pg file and unique file. Does pg file have MaxLFQ values?

vdemichev commented 1 year ago

Yes, all quantities are MaxLFQ

clubmann commented 1 year ago

Thank you for your clarification! Let me have one more question. What makes the difference of quantification value btw pg file and unique gene file?

vdemichev commented 1 year ago

Unique genes are quantified using precursors that match only to these genes and not other genes. Protein groups are quantifies using precursors allocated to these groups by the protein inference algorithm.

clubmann commented 1 year ago

I appreciate your clear answer. Thank you so much for sparing your time for me!