grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
34 stars 13 forks source link

Check & fix all examples #6

Closed lukeholman closed 5 years ago

lukeholman commented 6 years ago

Hey there, The example for biomaRt:: listDatasets does not work. The example reads:

if(interactive()){

#marts <- listMarts()
#index<-grep("ensembl",marts)

#mart <- useMart(marts[index])

#listDatasets(mart = mart)

#martDisconnect(mart = mart)
}

Here's what I get:

> marts <- listMarts()
> marts
               biomart               version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 93
2   ENSEMBL_MART_MOUSE      Mouse strains 93
3     ENSEMBL_MART_SNP  Ensembl Variation 93
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 93
> index<-grep("ensembl",marts)
> index
integer(0)
> mart <- useMart(marts[index])
Error in useMart(marts[index]) : 
  biomart argument is not a string. The biomart argument should be a single character string

I'm not sure why you suggest calling grep on a dataframe, and then indexing a dataframe with a single integer (not e.g. marts[,index]) - perhaps listMarts() used to return a character vector, and now it has changed?

Also, maybe don't use commenting in the examples - it makes it harder to copy-paste them. I guess you wanted to pass the CRAN checks, but you don't need the #s for this.

I'd also really love better worked example. Basically, I would like to get a list of all the gene-GO associations for the species Apis mellifera, and there seems to be no instructions on how to do this. I'm not sure if the pacakges biomaRt and biomartr are related, but the following, where x is a list of >13,000 Entrez gene IDs for this species, yields only ~300 GO terms, which is mysteriously wrong:

GO_tbl <- biomartr::getGO(organism = "Apis mellifera", 
                          genes    = x, filters  = "entrez_gene_id")

Maybe I am using the wrong filter, but it doesn't seem possible to get a list of filters (and then, why does it work a little bit?). Perhaps listing the possible filters in the help file would be a good idea?

grimbough commented 6 years ago

A few different points to address:


Thanks for pointing out the erroneous example. biomaRt is an now an old package and the underlying resources it accesses have changed a bit over the years. I expect these examples worked and made sense at some point, but then started failing and a previous package maintainer commented them out.

I have updated listDatasets, listFilters and listAttributes with examples that now work, and I'll keep this issue open to remind me to work through the other help pages too.


The biomartr package is completely unaffiliated from biomaRt. Despite the name, I believe biomartr was developed to provide a unified interface to a number of different online data stores, not just those running BioMart. If you have questions regarding that package you're better off contacting the package maintainer directly at https://github.com/ropensci/biomartr


If I was trying to list the GO IDs associated with a set of Entrez IDs in biomaRt I would do something like:

bee_mart = useMart("metazoa_mart", 
                   host="metazoa.ensembl.org", 
                   dataset="amellifera_eg_gene")

## substitute for the vector of 13,000 ids
entrez_gene_ids <- c("100576508", "107965534", "726750", "725548", "100576425")

getBM(mart = bee_mart, 
      attributes = c("entrezgene", "go_id"),
      filters = "entrezgene",
      values = entrez_gene_ids)

You can omit the the filters and values arguments to get everything available in the Ensembl mart.

However, looking at the results, I wonder if part of the reason behind seeing only 300 GO terms is that there is not a very comprehensive mapping between Entrez Gene IDs and Ensembl Gene IDs. The following queries will return the total number of records in Ensembl with an Entrez ID, and then the number of records that don't have an Entrez ID.

> nrow(getBM(mart = mart, attributes = "ensembl_gene_id", 
             filters = "with_entrezgene", values = TRUE))
[1] 6028
> nrow(getBM(mart = mart, attributes = "ensembl_gene_id", 
           filters = "with_entrezgene", values = FALSE))
[1] 9825

Clearly the total number of records is in a similar ball park to your 13,000 Entrez IDs, but only ~40% have a mapping between the two ID sets. As for why this is, you'd need to ask Ensembl, but you might have more luck either getting GO terms from a source that used Entrez naively, or annotating your list of genes with Ensembl IDs directly.


In the above example, if I didn't know the specific names of the filters I needed, I could use either listFilters() or searchFilters() to try and find them e.g.

> searchFilters(mart = bee_mart, pattern = "entrez")
              name                      description
14 with_entrezgene             With NCBI gene ID(s)
41      entrezgene NCBI gene ID(s) [e.g. 100049551]

Analogous functions are also present for datasets and attributes.