grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
34 stars 13 forks source link

`getBM()` attribute header/value mismatch #108

Open karlmakepeace opened 1 month ago

karlmakepeace commented 1 month ago

I am trying to access various Ensembl "sequences" page attributes and noticed that the values returned do not match with the appropriate attribute. For example, the gene_sequences_attributes_subset column "3utr" is filled with "TP53" (which should be in the "external_gene_name" attribute). Likewise, "external_gene_name" appears to be filled with "ensembl_gene_id" values (which were also not actually requested in the attributes argument of biomaRt::getBM()).

Similarly, the attribute headers/values in gene_sequences_attributes_all appear scrambled also.

# {biomaRt} bug attributes header/value mismatch examples #---------------------
# install.packages("tibble")

mart <- biomaRt::useEnsembl(
  biomart = "genes",
  version = "112", # latest as of 2024-08-13
  dataset = "hsapiens_gene_ensembl")

attributes <- biomaRt::listAttributes(
  mart = mart,
  page = "sequences",
  what = "name")

gene_sequences_attributes_all <- biomaRt::getBM(
  mart = mart,
  attributes = attributes,
  filters = c("external_gene_name"),
  values = list(c("TP53")))

gene_sequences_attributes_subset <- biomaRt::getBM(
  mart = mart,
  attributes = c("external_gene_name", "5utr","3utr"),
  filters = c("external_gene_name"),
  values = list(c("TP53")))

# Inspect in console #----------------------------------------------------------
gene_sequences_attributes_all |> tibble::as_tibble()
# # A tibble: 30 × 60
#    transcript_exon_intron gene_exon_intron transcript_flank   gene_flank      
#    <chr>                  <chr>            <chr>              <chr>           
#  1 Sequence unavailable   ENSG00000141510  ENSG00000141510.19 tumor protein p…
#  2 Sequence unavailable   ENSG00000141510  ENSG00000141510.19 tumor protein p…
#  3 Sequence unavailable   ENSG00000141510  ENSG00000141510.19 tumor protein p…
#  4 Sequence unavailable   ENSG00000141510  ENSG00000141510.19 tumor protein p…
#  5 Sequence unavailable   ENSG00000141510  ENSG00000141510.19 tumor protein p…
#  6 Sequence unavailable   ENSG00000141510  ENSG00000141510.19 tumor protein p…
#  7 Sequence unavailable   ENSG00000141510  ENSG00000141510.19 tumor protein p…
#  8 Sequence unavailable   ENSG00000141510  ENSG00000141510.19 tumor protein p…
#  9 Sequence unavailable   ENSG00000141510  ENSG00000141510.19 tumor protein p…
# 10 Sequence unavailable   ENSG00000141510  ENSG00000141510.19 tumor protein p…
# # ℹ 20 more rows
# # ℹ 56 more variables: coding_transcript_flank <chr>,
# #   coding_gene_flank <chr>, `5utr` <chr>, `3utr` <chr>, gene_exon <chr>,
# #   cdna <chr>, coding <chr>, peptide <chr>, upstream_flank <chr>,
# #   downstream_flank <chr>, ensembl_gene_id <chr>,
# #   ensembl_gene_id_version <chr>, description <chr>,
# #   external_gene_name <chr>, external_gene_source <chr>, …
# # ℹ Use `print(n = ...)` to see more rows

gene_sequences_attributes_subset |> tibble::as_tibble()
# # A tibble: 20 × 3
#    `3utr` external_gene_name `5utr`                                           
#    <chr>  <chr>              <chr>                                            
#  1 TP53   ENSG00000141510    CCCCATGTTCCTGGCTAGCCAAGGAACCACCAGTTGATTAGCAGAGAA…
#  2 TP53   ENSG00000141510    GGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGG…
#  3 TP53   ENSG00000141510    CTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACAC…
#  4 TP53   ENSG00000141510    TGAGGCCAGGAGATGGAGGCTGCAGTGAGCTGTGATCACACCACTGTG…
#  5 TP53   ENSG00000141510    Sequence unavailable                             
#  6 TP53   ENSG00000141510    AAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGG…
#  7 TP53   ENSG00000141510    TGAGGCCAGGAGATGGAGGCTGCAGTGAGCTGTGATCACACCACTGTG…
#  8 TP53   ENSG00000141510    CTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTC…
#  9 TP53   ENSG00000141510    AAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGG…
# 10 TP53   ENSG00000141510    AAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGG…
# 11 TP53   ENSG00000141510    CTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACAC…
# 12 TP53   ENSG00000141510    CTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACAC…
# 13 TP53   ENSG00000141510    TTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGAT…
# 14 TP53   ENSG00000141510    TTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGAT…
# 15 TP53   ENSG00000141510    TCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCT…
# 16 TP53   ENSG00000141510    GTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTT…
# 17 TP53   ENSG00000141510    TTTGTAATGCAGGGCTGAGGAGTGTCCGAAGAGAATGGGCAGCAGCCA…
# 18 TP53   ENSG00000141510    GGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTT…
# 19 TP53   ENSG00000141510    AAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGG…
# 20 TP53   ENSG00000141510    CTAGAGCTTTTGGGGAAGAGGGAGTGGTTGTTAAGAGATGAGATTAAA…