grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
32 stars 13 forks source link

Chromosome name filter doesn't recognize letters and numbers #103

Closed rbutleriii closed 2 months ago

rbutleriii commented 3 months ago

If I try to filter my results in getBM by chromosome_name, it works for numbers only, or letters only, but not both (i.e. 1-19 with X and Y):

library(data.table)
library(biomaRt)

mouse <- useEnsembl(biomart = "genes", dataset = "mmusculus_gene_ensembl")
chrs = c(as.character(1:19), "X", "Y", "MT")

b <- data.table(
  getBM(mart = mouse, 
    filter = "chromosome_name", 
    values = chrs, 
    attributes = c(
      "ensembl_gene_id", 
      "external_gene_name", 
      "chromosome_name", 
      "hsapiens_homolog_ensembl_gene", 
      "hsapiens_homolog_associated_gene_name"
    )
  )
)

table(b$chromosome_name)
#    1    2    3    4    5    6    7    9   10   11   12   13   14   15   16   17   18   19
# 3740 4073 3035 3937 3987 3413 5752  597 2923 3326 3048 2670 2696 2011 1746 2448 1477 1486

b <- data.table(
  getBM(mart = mouse, 
    filter = "chromosome_name", 
    values = chrs[20:22], 
    attributes = c(
      "ensembl_gene_id", 
      "external_gene_name", 
      "chromosome_name", 
      "hsapiens_homolog_ensembl_gene", 
      "hsapiens_homolog_associated_gene_name"
    )
  )
)
table(b$chromosome_name)
#   MT    X    Y
#   37 2954 1782

> sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.15.0 biomaRt_2.56.1

loaded via a namespace (and not attached):
 [1] rappdirs_0.3.3          utf8_1.2.4              generics_0.1.3          bitops_1.0-7            xml2_1.3.4
 [6] RSQLite_2.3.1           stringi_1.8.3           hms_1.1.3               digest_0.6.34           magrittr_2.0.3
[11] fastmap_1.1.1           blob_1.2.4              progress_1.2.2          AnnotationDbi_1.62.1    GenomeInfoDb_1.36.1
[16] DBI_1.1.3               httr_1.4.7              purrr_1.0.2             fansi_1.0.6             XML_3.99-0.14
[21] Biostrings_2.68.1       cli_3.6.2               rlang_1.1.3             crayon_1.5.2            dbplyr_2.3.2
[26] XVector_0.42.0          Biobase_2.60.0          bit64_4.0.5             withr_3.0.0             cachem_1.0.8
[31] tools_4.3.3             memoise_2.0.1           dplyr_1.1.4             GenomeInfoDbData_1.2.10 filelock_1.0.2
[36] BiocGenerics_0.48.1     curl_5.2.0              vctrs_0.6.5             R6_2.5.1                png_0.1-8
[41] stats4_4.3.3            lifecycle_1.0.4         BiocFileCache_2.8.0     zlibbioc_1.48.0         KEGGREST_1.40.0
[46] stringr_1.5.1           S4Vectors_0.40.2        IRanges_2.36.0          bit_4.0.5               pkgconfig_2.0.3
[51] pillar_1.9.0            glue_1.7.0              tibble_3.2.1            tidyselect_1.2.0        compiler_4.3.3
[56] prettyunits_1.1.1       RCurl_1.98-1.12```
grimbough commented 2 months ago

I'm not sure why this is happening, but it isn't due to the mix of numeric and character names for the chromosomes. If you combine just a subset of possible options it seems to work and returns numbers consistent with what you've shown:

library(biomaRt)

mouse <- useEnsembl(biomart = "genes", dataset = "mmusculus_gene_ensembl")

b <- getBM(mart = mouse, 
        filter = "chromosome_name", 
        values = c("19", "MT"), 
        attributes = c(
          "ensembl_gene_id", 
          "external_gene_name", 
          "chromosome_name", 
          "hsapiens_homolog_ensembl_gene", 
          "hsapiens_homolog_associated_gene_name"
        ), useCache = FALSE
  )
table(b$chromosome_name)
#> 
#>   19   MT 
#> 1486   37

It seems like it maybe related to the number of values you're using to filter on, although I'll agree it's suspicious that it breaks right at the divide between numbers and letters.

If I run the same query with all chromosome names in the Ensembl web interface I also find the "X", "Y", "MT" results missing. This suggests it's an issue with the Ensembl BioMart itself, rather than the biomaRt package. There's very little I can do if the server doesn't send a complete set of results back.

As a work around you can always try running this two separate queries and combining the results - unsatisfactory but it looks like it works. I'd also suggest contacting the Ensembl helpdesk (https://www.ensembl.org/Help/Contact) and reporting the problem. Feel free to link to this GitHub issue to demonstrate the problem.