grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
35 stars 13 forks source link

Wrong results returned for queries of multiple genes #96

Open Mitmischer opened 6 months ago

Mitmischer commented 6 months ago

Hello,

I have a couple of genes that I want to get the sequence of. I am using the following code for this:

library(biomaRt)

genelist=c(
"ENSG00000002549",
"ENSG00000004468",
"ENSG00000005059",
"ENSG00000006062",
"ENSG00000006210",
"ENSG00000007001",
"ENSG00000007171")

ensembl <- useEnsembl(biomart = "genes")
ensembl <- useDataset(dataset = "hsapiens_gene_ensembl", mart = ensembl)

cds_seq = getSequence(id = genelist,
                      type = "ensembl_gene_id", 
                      seqType = "gene_exon_intron",
                      mart = ensembl)[["gene_exon_intron"]]
                      df  = data.frame(gene=genelist, sequence=cds_seq)

Xfasta <- character(nrow(df) * 2)
Xfasta[c(TRUE, FALSE)] <- paste0(">", df$gene)
Xfasta[c(FALSE, TRUE)] <- df$sequence
writeLines(Xfasta, "genes_gene_exon_intron.fasta")

However, the retrieved sequences are sometimes mapped to the wrong genes. This issue is reproducible for me for a query with just 7 genes (for which the code is given above). When querying just a gene at the time, the code works but it will take one hour for a moderately sized set of genes (~300) and has a risk of getting your IP banned due to the frequency of requests.

Feel free to compare the results that you get to the correct sequences, which are provided here: genes_reference.txt