grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
34 stars 13 forks source link

hugo -> entrez incomplete #7

Open smoe opened 6 years ago

smoe commented 6 years ago

Hello,

I presume this is an issue with reassignments of IDs that is beyond your control. Anyway, you may want to have a look into that. The background is that I wanted to map the LINCS L1000 landmark genes to a dataset of mine. And these landmark genes happen to be declared as HUGO gene names on GSE70138_Broad_LINCS_gene_info_2017-03-06.txt in GEO.

These 1000 genes are only 978 in that list. Surprising enough. And when running biomaRt over it as in:

results <- getBM(attributes=c('hgnc_symbol','entrezgene'),
                 filters = 'hgnc_symbol',
                 values = lincs.lm.genes,
                 mart = ensembl)

it misses the following genes that I have now manually assigned:

missing.table <- matrix(c(
 "TOMM70A",  9868,
 "KIAA0196", 9897,
 "KIAA0907", 22889,
 "PAPD7",    11044,
 "IKBKAP",   8518,
 "TMEM5",    10329,
 "HDGFRP3",  50810,
 "PRUNE",    58497,
 "HN1L",     90861,    # this one is ambigous unless looking at the full description by LINCS
 "KIAA1033", 23325,
 "TMEM110",  375346,
 "SQRDL",    58472,
 "TMEM2",    23670,
 "ADCK3",    56997,
 "LRRC16A",  55604,
 "FAM63A",   55793
),ncol=2,byrow=T,dimnames=list(NULL,c("hgnc_symbol","entrez")))

It is only 1.6% of IDs affected, but, well, these are prominent genes since selected by LINCS. And I presume that many users will run into these kind of issues, so, maybe you have an idea about it.

Cheers,

Steffen

whiteorchid commented 5 years ago

Dear author, I used the getBM function to retrieve the gene names of the ensembl IDs (results by RSEM), while the numbers are not equal, is it still possible to find these name or they dont have name at present, great thanks !'

> genes <- getBM( attributes= "hgnc_symbol", values=mydata$gene_id, **uniqueRows = FALSE**,mart= mart)
> nrow(genes)

[1] 41766

mydata$gene_id [1] ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460 ENSG00000000938 [6] ENSG00000000971 ENSG00000001036 ENSG00000001084 ENSG00000001167 ENSG00000001460 [11] ENSG00000001461 ENSG00000001497 ENSG00000001561 ENSG00000001617 ENSG00000001626 ... [ reached getOption("max.print") -- omitted 59553 entries ] 60553 Levels: ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460 ... ENSGR0000281849