ventolab / CellphoneDB

CellPhoneDB can be used to search for a particular ligand/receptor, or interrogate your own HUMAN single-cell transcriptomics data.
https://www.cellphonedb.org/
MIT License
322 stars 51 forks source link

CPDB database_v4 errors #67

Closed Leonhard2000 closed 10 months ago

Leonhard2000 commented 1 year ago

Hi,

I found a few inconsistent entries within "gene_table" in the v4 database:

  1. Ensemb_ID of NPBWR1 is outdated (gene_id 1209) old: ENSG00000183729 new: ENSG00000288611

  2. Some gene_name, ensembl & protein_id do not fit together: (for all other entries the gene_name has the same protein_id but can have several ensembl_ids)

The gene CCL3L1 = ENSG00000276085 (id_gene 182 & 186) has 2 different protein entries (415 & 5). It should be 415 for all (even CCL3L3) as they have the same UniProt-ID P16619. -> interaction_table must be updated too because 415 & 5 are now treated separately -> duplicate in protein_table is unneccessary

The gene IFNA1 = ENSG00000197919 (id_gene 730 & 732) has 2 different protein entries (183 & 14). I should be 183 for all (even IFNA13) as they have the same UniProt-ID P01562. -> interaction_table must be updated too because 14 & 183 both have the same interaction to CR2 (protein_id 449) and thus are calculated twice -> duplicate in protein_table is unneccessary

  1. Following genes have the same Ensembl_ID but different synonyms are used (use 1 consistent gene_name): FAM19A1 = TAFA1 = ENSG00000183662 (id_gene 489 & 490) FAM19A4 = TAFA4 = ENSG00000163377 (id_gene 491 & 492) FAM19A5 = TAFA5 = ENSG00000219438 (id_gene 493 & 494) FAM2213B = PRXL2B = ENSG00000157870 (id_gene 495 & 497)

  2. to_be_reviewed Ensembl_IDs: ENSG00000233056 for ERVH48-1 (id_gene 481) ENSG00000211451 for GNRHR2 (id_gene 604) ENSG00000210082 for MT-RNR2 (id_gene 1166)

Hope this helps. Leo

luzgaral commented 1 year ago

Dear Leonhard2000,

You can translate ids using your preferred Ensembl version and provide gene symbols to CPDB. In this way you can decide the policies to handle synonyms and 1:many matches. Also, this will enable your gene symbols to be consistent across your analyses which outputs contain gene symbols.

Thank you for raising ENSEMBL updates, we will definitively look into them.

Best,

Luz

datasome commented 10 months ago

Hi Leo,

Many thanks for your help in keep the mapping of CellphoneDB data to Ensembl. I know a long time has passed since this issue was first reported, but for the record I wanted to say that 2, 4 and 5 above were fixed in https://github.com/ventolab/cellphonedb-data/blob/v5-release/data/gene_input.csv and 1 in https://github.com/ventolab/cellphonedb-data/blob/v5.1-release/data/gene_input.csv (v5.1 is the release we're only now beginning to prepare but don't have a date for it set yet).

Best,

Robert.