grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
34 stars 13 forks source link

biomaRt not finding all genes #37

Open danielcgingerich opened 3 years ago

danielcgingerich commented 3 years ago

I need to align a dataset mapped to GRCh38.p2 (ensembl 79) and a dataset mapped to GRCh38.p13 (ensembl 98).
The first dataset (ensembl 79) has gene names and entrez IDs. The second dataset (ensembl 98) has gene names and ENSG IDs. I want to convert ensembl 79 entrez IDs to ENSG IDs. When I query on biomaRt, almost half of the genes are not found. I have tried using both "external_gene_name" and "enterezgene" as filters. I have tried using both the most recent mart and archived marts (ensembl 77-80).

FYI: approximately 25000 genes were not found, and of these genes about 10000 of them are pseudogenes.

Code below:

listEnsemblArchives()
biomart <- useMart("ensembl", host = "https://oct2014.archive.ensembl.org", dataset = "hsapiens_gene_ensembl")
filters <-listFilters(biomart)
attributes <- listAttributes(biomart)

m1.biomart <- getBM(filters = "entrezgene", attributes = c("ensembl_gene_id","entrezgene", "external_gene_name", "hgnc_symbol"), values = m1.entrez.ids$entrez_id,  mart = biomart)

length(unique(m1.entrez.ids$entrez_id))
[1] 50281

length(unique(m1.biomart$entrezgene))
[1] 25987

length(unique(m1.biomart$ensembl_gene_id))
[1] 28701