stephenturner / annotables

R data package for annotating/converting Gene IDs
http://www.gettinggeneticsdone.blogspot.com/2015/11/annotables-convert-gene-ids.html
161 stars 34 forks source link

Entries with same ENSG #13

Closed pesho-ivanov closed 6 years ago

pesho-ivanov commented 6 years ago

I suspect mistakes in gene symbols.

I was making the wrong assumption that there is unique correspondence between the rows in grch38/grch37 and the different ENSGs. It turned out there there are ensgene repetitions:

> sum(data.frame(table(grch38$ensgene))$Freq > 1)                                 
[1] 361

I checked several of these 361 duplicating genes and it seems that the entrez gid's are the only difference:

> grch38[ grch38$ensgene == "ENSG00000198668",  ]
# A tibble: 3 x 9
  ensgene         entrez symbol chr     start    end strand biotype description
  <chr>            <int> <chr>  <chr>   <int>  <int>  <int> <chr>   <chr>      
1 ENSG00000198668    801 CALM1  14     9.04e7 9.04e7      1 protei… calmodulin…
2 ENSG00000198668    805 CALM1  14     9.04e7 9.04e7      1 protei… calmodulin…
3 ENSG00000198668    808 CALM1  14     9.04e7 9.04e7      1 protei… calmodulin…

I further looked at the NCBI webside for the different entrez gid's and they point to different genes CALM genes (not only CALM1).

CALM1: https://www.ncbi.nlm.nih.gov/gene/?term=801
CALM2: https://www.ncbi.nlm.nih.gov/gene/?term=805
CALM3: https://www.ncbi.nlm.nih.gov/gene/?term=808

Version:

> packageVersion("annotables")
[1] ‘0.1.91’
stephenturner commented 6 years ago

Yes, interesting cases where annotations differ. I think this has been seen before.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4339237/

https://www.biostars.org/p/16505/

I'm pulling data directly from Ensembl. What might be interesting is to take all the cases where a single Ensembl ID maps to multiple Entrez IDs, then use those Entrez IDs together with the bioconductor annotation packages to get the gene symbols based on those IDs, then look at cases where the gene symbols differ. Count, try to explain why.