grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
34 stars 13 forks source link

getLDS() errors in scan and 'both datasets must be located on the same host' #30

Closed yuewangpanda closed 3 years ago

yuewangpanda commented 3 years ago

Hi there,

So I am trying to build a mapping table between rat and human. My origin code is:

human <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
rat <- useMart("ensembl", dataset = "rnorvegicus_gene_ensembl")
map <- biomaRt::getLDS(attributes = c("ensembl_gene_id", "rgd_symbol"),
                       mart = rat,
                       attributesL = c("ensembl_gene_id", "hgnc_symbol"), 
                       martL = human, 
                       uniqueRows = TRUE)

This chunk suddenly can not work today. Found a similar issue here. Following the suggestion in the issue here, I first updated the biomaRt package to 2.45.9 and then revised my code to:

human <- useEnsembl("ensembl", dataset = "hsapiens_gene_ensembl")
rat <- useEnsembl("ensembl", dataset = "rnorvegicus_gene_ensembl")
map <- getLDS(attributes = c("ensembl_gene_id", "rgd_symbol"),
                       mart = rat,
                       attributesL = c("ensembl_gene_id", "hgnc_symbol"), 
                       martL = human, 
                       uniqueRows = TRUE)

This time, the error is "both datasets must be located on the same host". Then, I tried to change my code to

human <- useEnsembl("ensembl", dataset = "hsapiens_gene_ensembl", host = "www.ensembl.org")
rat <- useEnsembl("ensembl", dataset = "rnorvegicus_gene_ensembl", host = "www.ensembl.org")
map <- getLDS(attributes = c("ensembl_gene_id", "rgd_symbol"),
                       mart = rat,
                       attributesL = c("ensembl_gene_id", "hgnc_symbol"), 
                       martL = human, 
                       uniqueRows = TRUE)

The scan error came out again. "Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 3 elements".

My working environment:

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rnaseqToolbox_0.0.86 limma_3.40.6         here_0.1             msigdbr_7.2.1        biomaRt_2.45.9      
 [6] forcats_0.5.0        stringr_1.4.0        purrr_0.3.4          ggbeeswarm_0.6.0     pheatmap_1.0.12     
[11] haven_2.3.1          VennDiagram_1.6.20   futile.logger_1.4.3  foreach_1.5.1        DT_0.16             
[16] readxl_1.3.1         rmarkdown_2.4        glue_1.4.2           ggpubr_0.4.0         knitr_1.30          
[21] magrittr_1.5         readr_1.4.0          tibble_3.0.4         tidyr_1.1.2          rlang_0.4.8         
[26] dplyr_1.0.2          plotly_4.9.2.1       ggplot2_3.3.2        shiny_1.5.0          Biobase_2.44.0      
[31] BiocGenerics_0.30.0 

loaded via a namespace (and not attached):
  [1] utf8_1.1.4             tidyselect_1.1.0       RSQLite_2.2.1          AnnotationDbi_1.46.1   htmlwidgets_1.5.2     
  [6] BiocParallel_1.18.1    Rtsne_0.15             munsell_0.5.0          codetools_0.2-16       withr_2.3.0           
 [11] colorspace_1.4-1       rstudioapi_0.11        stats4_3.6.1           ggsignif_0.6.0         TTR_0.24.2            
 [16] NMF_0.23.0             labeling_0.3           GenomeInfoDbData_1.2.1 KMsurv_0.1-5           farver_2.0.3          
 [21] bit64_4.0.5            rprojroot_1.3-2        vctrs_0.3.4            generics_0.0.2         lambda.r_1.2.4        
 [26] xfun_0.18              R6_2.4.1               doParallel_1.0.15      GenomeInfoDb_1.20.0    clue_0.3-57           
 [31] locfit_1.5-9.4         bitops_1.0-6           fgsea_1.10.1           assertthat_0.2.1       promises_1.1.1        
 [36] scales_1.1.1           nnet_7.3-14            forecast_8.13          beeswarm_0.2.3         gtable_0.3.0          
 [41] Cairo_1.5-12.2         processx_3.4.4         timeDate_3043.102      GlobalOptions_0.1.2    splines_3.6.1         
 [46] rstatix_0.6.0          lazyeval_0.2.2         broom_0.7.1            checkmate_2.0.0        BiocManager_1.30.10   
 [51] yaml_2.2.1             reshape2_1.4.4         abind_1.4-5            crosstalk_1.1.0.1      backports_1.1.10      
 [56] httpuv_1.5.4           quantmod_0.4.17        tools_3.6.1            gridBase_0.4-7         ellipsis_0.3.1        
 [61] RColorBrewer_1.1-2     Rcpp_1.0.5             plyr_1.8.6             progress_1.2.2         zlibbioc_1.30.0       
 [66] RCurl_1.98-1.2         ps_1.4.0               prettyunits_1.1.1      viridis_0.5.1          GetoptLong_1.0.3      
 [71] fracdiff_1.5-1         cowplot_1.1.0          S4Vectors_0.22.1       zoo_1.8-8              ggrepel_0.8.2         
 [76] cluster_2.1.0          data.table_1.13.0      futile.options_1.0.1   openxlsx_4.2.2         circlize_0.4.10       
 [81] lmtest_0.9-38          survminer_0.4.8        hms_0.5.3              mime_0.9               evaluate_0.14         
 [86] xtable_1.8-4           XML_3.99-0.3           rio_0.5.16             IRanges_2.18.3         gridExtra_2.3         
 [91] shape_1.4.5            compiler_3.6.1         crayon_1.3.4           htmltools_0.5.0        later_1.1.0.1         
 [96] DBI_1.1.0              formatR_1.7            dbplyr_1.4.4           ComplexHeatmap_2.0.0   Matrix_1.2-18         
[101] car_3.0-10             cli_2.1.0              quadprog_1.5-8         GenomicRanges_1.36.1   pkgconfig_2.0.3       
[106] km.ci_0.5-2            registry_0.5-1         foreign_0.8-71         vipor_0.4.5            rngtools_1.5          
[111] webshot_0.5.2          pkgmaker_0.31.1        XVector_0.24.0         bibtex_0.4.2.3         callr_3.5.1           
[116] digest_0.6.25          cellranger_1.1.0       fastmatch_1.1-0        survMisc_0.5.5         edgeR_3.26.8          
[121] curl_4.3               urca_1.3-0             rjson_0.2.20           lifecycle_0.2.0        nlme_3.1-149          
[126] jsonlite_1.7.1         tseries_0.10-47        carData_3.0-4          viridisLite_0.3.0      fansi_0.4.1           
[131] pillar_1.4.6           lattice_0.20-41        pkgbuild_1.1.0         fastmap_1.0.1          httr_1.4.2            
[136] survival_3.2-7         remotes_2.2.0          xts_0.12.1             zip_2.1.1              png_0.1-7             
[141] iterators_1.0.13       bit_4.0.4              stringi_1.5.3          blob_1.2.1             memoise_1.1.0   

Not sure how to fix this problem. Any help will be appreciated.

Thanks, Frank

grimbough commented 3 years ago

I'm afraid I can't reproduce this issue with the latest version of biomaRt:

You original code works fine for me.

library(biomaRt)
packageVersion('biomaRt')
#> [1] ‘2.45.9’

human <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
rat <- useMart("ensembl", dataset = "rnorvegicus_gene_ensembl")
map_1 <- biomaRt::getLDS(attributes = c("ensembl_gene_id", "rgd_symbol"),
                       mart = rat,
                       attributesL = c("ensembl_gene_id", "hgnc_symbol"), 
                       martL = human, 
                       uniqueRows = TRUE)

head(map_1)
#>       Gene.stable.ID RGD.symbol Gene.stable.ID.1 HGNC.symbol
#> 1 ENSRNOG00000018338       Vwa1  ENSG00000179403        VWA1
#> 2 ENSRNOG00000023627     Cfap74  ENSG00000142609      CFAP74
#> 3 ENSRNOG00000036869    Tmem88b  ENSG00000205116     TMEM88B
#> 4 ENSRNOG00000031766     Mt-cyb  ENSG00000198727      MT-CYB
#> 5 ENSRNOG00000021802      Isg15  ENSG00000187608       ISG15
#> 6 ENSRNOG00000035650    Mir3548  ENSG00000207607     MIR200A

The version using useEnsembl() also seems to work.

human <- useEnsembl("ensembl", dataset = "hsapiens_gene_ensembl")
rat <- useEnsembl("ensembl", dataset = "rnorvegicus_gene_ensembl")
map_2 <- getLDS(attributes = c("ensembl_gene_id", "rgd_symbol"),
              mart = rat,
              attributesL = c("ensembl_gene_id", "hgnc_symbol"), 
              martL = human, 
              uniqueRows = TRUE)

We can't use identical() to compare the results directly, as BioMart doesn't return things in a predictable order, but the returned objects are the same dimensions and all gene IDs from one are present in the other.

dim(map_1)
#> [1] 25677     4
dim(map_2)
#> [1] 25677     4
all(map_1$Gene.stable.ID %in% map_2$Gene.stable.ID)
#> [1] TRUE

Did you get messages like "Ensembl site unresponsive, trying uswest mirror" when creating the marts? It's possible that the automatic selection of a mirror would have picked two different mirrors for your human and rat marts, which would lead to the "both datasets must be located on the same host" error. If that's the case, I would try creating one of the mart opjects again, and paying attention to which mirror it finally ends up on. You can check by looking in the host slot for your two mart objects, where the URL will need to be the same.

human@host
#> [1] "https://www.ensembl.org:443/biomart/martservice"
rat@host
#> [1] "https://www.ensembl.org:443/biomart/martservice"

The version using host = "www.ensembl.org" will fail, as Ensembl seem to have switched to requiring https to access their server. Currently you have to explicitly set this if you provide the host argument to useMart(). If you aren't using that you query gets redirected and returns the Ensembl homepage, which biomaRt doesn't understand and throws the error you're seeing.

yuewangpanda commented 3 years ago

Thank you for your response. Very helpful. I tried my code today and it worked. So I guess it was a connection issue.