grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
34 stars 13 forks source link

Internal merge after query results in loss of information #47

Closed mcanouil closed 2 years ago

mcanouil commented 3 years ago

Hi,

I did not look at the code, but there is most likely a merge performed internally to aggregate the information from differents databases and apparently the merge is set to only keep something available everywhere (i.e., all = FALSE). This is not per se a bug, but it's definitely not something expected when performing a request to get multiple attributes, thus I think this should be changed. Below a reproducible example.

library(biomaRt) #  2.46.3
mart <- useEnsembl(biomart = "ensembl", dataset = "mmusculus_gene_ensembl", version =  98)
getBM(
  attributes = c(
    "ensembl_gene_id", 
    # "entrezgene_id",
    # "uniprotswissprot",
    # "chromosome_name", 
    # "start_position", 
    # "end_position", 
    "external_gene_name"
  ),
  filters = "ensembl_gene_id",
  values = list("ENSMUSG00000000031"),
  mart = mart
)
#>      ensembl_gene_id external_gene_name
#> 1 ENSMUSG00000000031                H19

getBM(
  attributes = c(
    "ensembl_gene_id", 
    "entrezgene_id",
    "uniprotswissprot",
    "chromosome_name",
    "start_position",
    "end_position",
    "external_gene_name"
  ),
  filters = "ensembl_gene_id",
  values = list("ENSMUSG00000000031"),
  mart = mart
)
#> [1] ensembl_gene_id    entrezgene_id      uniprotswissprot   chromosome_name    start_position     end_position      
#> [7] external_gene_name
#> <0 rows> (or 0-length row.names)
grimbough commented 3 years ago

Thanks for the report. I'm afraid this behaviour isn't something I can alter in the biomaRt package. It's just an alternative interface to https://www.ensembl.org/biomart/martview which is where the data processing happens.

This link shows the same query with the attributes limited to ensembl_gene_id and uniprotswissprot. It looks like that's sufficient to replicate the problem, and I don't think there's anything client side that can be done about this.

It does seem like odd behaviour, and I'd suggest highlighting the issue with Ensembl directly.

mcanouil commented 3 years ago

Thanks for the reply. Maybe worth a warning note about queries on multiple tables in Ensembl using getBM (even if the issue is not in biomaRt) in the documentation?

grimbough commented 3 years ago

It's pretty hard to formulate something specific to say about it; the architecture of the underlying database and the query engine are opaque to me. For example, the following asks for three ID types and is quite happy to give back an NA cell (it's empty in the browser) for UCSC when it doesn't find one.

library(biomaRt) 
mart <- useEnsembl(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
getBM(
  attributes = c(
    "ensembl_gene_id", 
    "entrezgene_id",
    "ucsc"
  ),
  filters = "ensembl_gene_id",
  values = list("ENSMUSG00000000031"),
  mart = mart
)
#>      ensembl_gene_id entrezgene_id ucsc
#> 1 ENSMUSG00000000031         14955   NA

I don't want to write something that discourages useful queries, but it's also impossible to know how many user queries might have been discarded as returning nothing, when a change in attributes might have yielded something.

mcanouil commented 3 years ago

I understand, it's also opaque to me, especially since I don't use often this API to get annotations. Users have to try in order to discover how the API works for their particular queries^^ I guess we can close this issue ;)