grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
34 stars 13 forks source link

request: find* methods in addition to list*? #5

Open biocyberman opened 6 years ago

biocyberman commented 6 years ago

In many cases, I "list" something just to find the correct name for datasets or attribute I want to use. list* methods are useful for exploring what are there, but their output are inconvenient to use for finding a particular name. It's because the output is lengthy, and I have to use grep with unsatisfactory result quite sometimes.

Here is my current workflow:

ensembl <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "rnorvegicus_gene_ensembl")
listAttributes(ensembl) # and look through the output line by line. 
#Rstudio search function will help abit, but some cases the output is truncated. So searching in Rstudio does not always help.
grep("gene_name", listAttributes(ensembl), value = T) # Try to grep it, 
# but some case the output is very unreadable:
> grep("norvegicus", listDatasets(useEnsembl("ensembl")), value = T)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ... <truncated>
"c(\"sboliviensis_gene_ensembl\", \"xtropicalis_gene_ensembl\", \"saraneus_gene_ensembl\", \"nleucogenys_gene_ensembl\", \"vpacos_gene_ensembl\", \"pcoquereli_gene_ensembl\", \"aplatyrhynchos_gene_ensembl\", \"lchalumnae_gene_ensembl\", \"gmorhua_gene_ensembl\", \"panubis_gene_ensembl\", \"pformosa_gene_ensembl\", \"dordii_gene_ensembl\", \"dnovemcinctus_gene_ensembl\", \"mfascicularis_gene_ensembl\", \"clanigera_gene_ensembl\", \"scerevisiae_gene_ensembl\", \"oprinceps_gene_ensembl\", \"acarolinensis_gene_ensembl\", \"mlucifugus_gene_ensembl\", \n\"rroxellana_gene_ensembl\", \"mnemestrina_gene_ensembl\", \"rnorvegicus_gene_ensembl\", \"xmaculatus_gene_ensembl\", \"ggallus_gene_ensembl\", \"csavignyi_gene_ensembl\", \"ngalili_gene_ensembl\", \"oanatinus_gene_ensembl\", \"cintestinalis_gene_ensembl\", \"jjaculus_gene_ensembl\", \"ppaniscus_gene_ensembl\", \"oniloticus_gene_ensembl\", \"hmale_gene_ensembl\", \"psinensis_gene_ensembl\", \"tguttata_gene_ensembl\", \"tnigroviridis_gene_ense... <truncated>

So I want to request find* methods, with wildcards, regex and fuzzy match support:

findDataset("*norvegicus", useEnsembl("ensembl"))
findAttribute(".*gene$", useEnsembl("ensembl"))

This will simpilfy the workflow and save time

grimbough commented 6 years ago

Thanks for the suggestion. I've added the functions searchDatasets(), searchAttributes(), and searchFilters(). These take a mart argument and a pattern, which is a regex string that matches against all columns returned by the appropriate listX function, e.g.

ensemblMart <- useEnsembl("ensembl")
searchDatasets(pattern = "norvegicus", mart = ensemblMart)
                    dataset          description  version
87 rnorvegicus_gene_ensembl Rat genes (Rnor_6.0) Rnor_6.0
ensemblMart <- useDataset(dataset = "rnorvegicus_gene_ensembl", 
                          mart = ensemblMart)
searchFilters(mart = ensemblMart, pattern = "ensembl.*id$")
                    name                                       description
51       ensembl_gene_id       Gene stable ID(s) [e.g. ENSRNOG00000000001]
53 ensembl_transcript_id Transcript stable ID(s) [e.g. ENSRNOT00000000008]
55    ensembl_peptide_id    Protein stable ID(s) [e.g. ENSRNOP00000000008]
57       ensembl_exon_id              Exon ID(s) [e.g. ENSRNOE00000000009]

Let me know if that fits what you're looking for.

biocyberman commented 6 years ago

Fantastic! I will try and see.

biocyberman commented 6 years ago

@grimbough The functions work generally much better and their list* counterparts. It will be even better if you care to implement what to search as well. They currently search through all columns, we would have more fine grain control over the search with what.