grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
34 stars 13 forks source link

`getBM()` error "... more columns than column names" #109

Open karlmakepeace opened 1 month ago

karlmakepeace commented 1 month ago

I am trying to access 5' and 3' UTR Ensembl sequence page attributes ("5utr" and "3utr") using getBM() but encounter the following error:

Error in read.table(text = postRes, sep = "\t", header = TRUE, quote = quote, : more columns than column names

If I include an additional attribute (e.g. "external_gene_name") then getBM() will return results without error (however there is still a header/value mismatch as described in issue #108),

# {biomaRt} error "... more columns than column names" #------------------------
# install.packages("tibble")

mart <- biomaRt::useEnsembl(
  biomart = "genes",
  version = "112", # latest as of 2024-08-13
  dataset = "hsapiens_gene_ensembl")

gene_sequences_attributes_subset_1 <- biomaRt::getBM(
  mart = mart,
  attributes = c("external_gene_name", "5utr", "3utr"),
  filters = c("external_gene_name"),
  values = list(c("TP53")))

gene_sequences_attributes_subset_1 |> tibble::as_tibble()
# # A tibble: 20 × 3
#    `3utr` external_gene_name `5utr`                                         
#    <chr>  <chr>              <chr>                                          
#  1 TP53   ENSG00000141510    CCCCATGTTCCTGGCTAGCCAAGGAACCACCAGTTGATTAGCAGAG…
#  2 TP53   ENSG00000141510    GGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCA…
#  3 TP53   ENSG00000141510    CTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGAC…
#  4 TP53   ENSG00000141510    TGAGGCCAGGAGATGGAGGCTGCAGTGAGCTGTGATCACACCACTG…
#  5 TP53   ENSG00000141510    Sequence unavailable                           
#  6 TP53   ENSG00000141510    AAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCC…
#  7 TP53   ENSG00000141510    TGAGGCCAGGAGATGGAGGCTGCAGTGAGCTGTGATCACACCACTG…
#  8 TP53   ENSG00000141510    CTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGC…
#  9 TP53   ENSG00000141510    AAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCC…
# 10 TP53   ENSG00000141510    AAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCC…
# 11 TP53   ENSG00000141510    CTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGAC…
# 12 TP53   ENSG00000141510    CTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGAC…
# 13 TP53   ENSG00000141510    TTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGG…
# 14 TP53   ENSG00000141510    TTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGG…
# 15 TP53   ENSG00000141510    TCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGG…
# 16 TP53   ENSG00000141510    GTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGC…
# 17 TP53   ENSG00000141510    TTTGTAATGCAGGGCTGAGGAGTGTCCGAAGAGAATGGGCAGCAGC…
# 18 TP53   ENSG00000141510    GGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAG…
# 19 TP53   ENSG00000141510    AAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCC…
# 20 TP53   ENSG00000141510    CTAGAGCTTTTGGGGAAGAGGGAGTGGTTGTTAAGAGATGAGATTA…

gene_sequences_attributes_subset_2 <- biomaRt::getBM(
  mart = mart,
  attributes = c("5utr", "3utr"),
  filters = c("external_gene_name"),
  values = list(c("TP53")))
# Error in read.table(text = postRes, sep = "\t", header = TRUE, quote = quote,  : 
#                       more columns than column names
grimbough commented 4 weeks ago

I think both this and #108 are because you're asking for both 5utr and 3utr. If you look at the web interface for the sequence attributes, you'll see that those options are a radio button and you can only select one of them:

image

Unfortunately the BioMart API doesn't provide any way to detect which attributes are mutually exclusive like this, so I can't detect and filter it in biomaRt. It seem the server is also happy to run a query, even if what comes back doesn't reflect exactly what was asked for.

The getSequence() function does a similar job to what you're looking for and will fail if you ask for more than one sequence type, but I'm not sure there's anyway I can catch this in generic calls to getBM(). I think it's an issue server side to even allow a query like this to run if it isn't viable.