grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
34 stars 13 forks source link

Error in getLDS function when applied to mmc57bl6nj_gene_ensembl database #21

Closed Sophia409 closed 3 years ago

Sophia409 commented 4 years ago

Hello,

Thank you for your work in this package.

I wanted to find a gene ortholog table between mouse and Macaca fascicularis. Unfortunately, I met some difficulties and could't find the solution.

First, I'm wondering why getLDS function failed when I changed biomart database from mmusculus_gene_ensembl to mmc57bl6nj_gene_ensembl. I searched the attributes list of mmc57bl6nj_gene_ensembl and confirmed the existence of mgi_symbol.

Second, what is the difference between database mmusculus_gene_ensembl and mmc57bl6nj_gene_ensembl? My sc-RNAseq data derived from C57BL6 mice. So in my view, mmc57bl6nj_gene_ensembl should be suitable for my analysis though most of code I searched on the internet chose mmc57bl6nj_gene_ensembl for analysis. What's your advice on it?

Third, mouse gene Fabp7 could't be translated to macaque ortholog genes. This really troubled me because I'm sure this gene (FABP7) was expressed in my macaque sc-RNAseq data. What's the reason?

I'm looking forward to your reply. Sophia

>   mouse = useMart("ensembl", dataset = "mmusculus_gene_ensembl")
>   macaque = useMart("ensembl", dataset = "mfascicularis_gene_ensembl")
>   x <- c("Fabp7", "Tlx3", "Gzmb")
>   genesV2 = getLDS(attributes = c("mgi_symbol"), filters = "mgi_symbol", values = x , mart = mouse, attributesL = c("hgnc_symbol"), martL = macaque, uniqueRows=T)
>   macaquex <- unique(genesV2[, 2])
>   head(macaquex)
[1] "TLX3" "GZMB"
>   # macaque = useMart("ensembl")
>   # datasets <- listDatasets(macaque)
>   mouse = useMart("ENSEMBL_MART_MOUSE", dataset = "mmc57bl6nj_gene_ensembl")
>   genesV2 = getLDS(attributes = c("mgi_symbol"), filters = "mgi_symbol", values = x , mart = mouse, attributesL = c("hgnc_symbol"), martL = macaque, uniqueRows=T)
Error in getLDS(attributes = c("mgi_symbol"), filters = "mgi_symbol",  : 
  Query ERROR: caught BioMart::Exception::Query: returning undef ... missing attributes for your exportable? 
>   mouse.attributes <- listAttributes(mouse)
>   grep("mgi_symbol",mouse.attributes$name,value = TRUE)
[1] "mgi_symbol"

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936   
[3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C                              
[5] LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
 [1] splines   stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] biomaRt_2.42.1              loomR_0.2.1.9000            hdf5r_1.3.1                 R6_2.4.1                   
 [5] doRNG_1.8.2                 rngtools_1.5                doParallel_1.0.15           iterators_1.0.12           
 [9] foreach_1.4.8               SCopeLoomR_0.5.0            RcisTarget_1.6.0            AUCell_1.8.0               
[13] SCENIC_1.1.2-2              set_1.1                     openxlsx_4.1.4              data.table_1.12.8          
[17] enrichplot_1.6.1            gridExtra_2.3               gtools_3.8.1                DoubletFinder_2.0.2        
[21] org.Mm.eg.db_3.10.0         AnnotationDbi_1.48.0        clusterProfiler_3.14.3      scales_1.1.0               
[25] ggThemeAssist_0.1.5         scater_1.14.6               ggsci_2.9                   ape_5.3                    
[29] pheatmap_1.0.12             plyr_1.8.5                  cowplot_1.0.0               SingleCellExperiment_1.8.0 
[33] SummarizedExperiment_1.16.1 DelayedArray_0.12.2         BiocParallel_1.20.1         matrixStats_0.55.0         
[37] GenomicRanges_1.38.0        GenomeInfoDb_1.22.0         IRanges_2.20.2              S4Vectors_0.24.3           
[41] monocle_2.14.0              DDRTree_0.1.5               irlba_2.3.3                 VGAM_1.1-2                 
[45] Biobase_2.46.0              BiocGenerics_0.32.0         Matrix_1.2-18               RColorBrewer_1.1-2         
[49] corrplot_0.84               ggcorrplot_0.1.3            forcats_0.5.0               stringr_1.4.0              
[53] dplyr_0.8.4                 purrr_0.3.3                 readr_1.3.1                 tidyr_1.0.2                
[57] tibble_2.1.3                ggplot2_3.2.1               tidyverse_1.3.0             leiden_0.3.3               
[61] ggthemes_4.2.0              Seurat_3.1.5                reticulate_1.14            

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.1           R.methodsS3_1.8.0        bit64_0.9-7              R.utils_2.9.2           
  [5] RCurl_1.98-1.1           generics_0.0.2           RSQLite_2.2.0            RANN_2.6.1              
  [9] europepmc_0.3            combinat_0.0-8           future_1.16.0            bit_1.1-15.2            
 [13] xml2_1.2.2               lubridate_1.7.4          httpuv_1.5.2             assertthat_0.2.1        
 [17] viridis_0.5.1            hms_0.5.3                promises_1.1.0           fansi_0.4.1             
 [21] progress_1.2.2           caTools_1.18.0           dbplyr_1.4.2             readxl_1.3.1            
 [25] igraph_1.2.4.2           DBI_1.1.0                htmlwidgets_1.5.1        sparsesvd_0.2           
 [29] feather_0.3.5            backports_1.1.5          annotate_1.64.0          RcppParallel_4.4.4      
 [33] vctrs_0.2.3              ROCR_1.0-7               withr_2.1.2              ggforce_0.3.1           
 [37] triebeard_0.3.0          sctransform_0.2.1        prettyunits_1.1.1        cluster_2.1.0           
 [41] DOSE_3.12.0              lazyeval_0.2.2           crayon_1.3.4             labeling_0.3            
 [45] pkgconfig_2.0.3          slam_0.1-47              tweenr_1.0.1             nlme_3.1-144            
 [49] vipor_0.4.5              rlang_0.4.4              globals_0.12.5           lifecycle_0.2.0         
 [53] miniUI_0.1.1.1           BiocFileCache_1.10.2     modelr_0.1.6             rsvd_1.0.3              
 [57] cellranger_1.1.0         polyclip_1.10-0          lmtest_0.9-37            graph_1.64.0            
 [61] urltools_1.7.3           zoo_1.8-7                reprex_0.3.0             beeswarm_0.2.3          
 [65] ggridges_0.5.2           png_0.1-7                viridisLite_0.3.0        bitops_1.0-6            
 [69] R.oo_1.23.0              KernSmooth_2.23-16       blob_1.2.1               DelayedMatrixStats_1.8.0
 [73] qvalue_2.18.0            gridGraphics_0.5-0       memoise_1.1.0            GSEABase_1.48.0         
 [77] magrittr_1.5             ica_1.0-2                gplots_3.0.3             gdata_2.18.0            
 [81] zlibbioc_1.32.0          compiler_3.6.3           HSMMSingleCell_1.6.0     lsei_1.2-0              
 [85] fitdistrplus_1.0-14      cli_2.0.2                XVector_0.26.0           listenv_0.8.0           
 [89] patchwork_1.0.1          pbapply_1.4-2            formatR_1.7              MASS_7.3-51.5           
 [93] tidyselect_1.0.0         stringi_1.4.6            densityClust_0.3         GOSemSim_2.12.0         
 [97] askpass_1.1              BiocSingular_1.2.2       ggrepel_0.8.1            grid_3.6.3              
[101] fastmatch_1.1-0          tools_3.6.3              future.apply_1.4.0       rstudioapi_0.11         
[105] farver_2.0.3             Rtsne_0.15               ggraph_2.0.1             digest_0.6.25           
[109] rvcheck_0.1.8            BiocManager_1.30.10      FNN_1.1.3                shiny_1.4.0.1           
[113] monocle3_0.2.1           qlcMatrix_0.9.7          Rcpp_1.0.3               broom_0.5.5             
[117] later_1.0.0              RcppAnnoy_0.0.14         httr_1.4.1               npsurv_0.4-0            
[121] colorspace_1.4-1         rvest_0.3.5              XML_3.99-0.3             fs_1.3.1                
[125] uwot_0.1.5               graphlayouts_0.5.0       ggplotify_0.0.5          plotly_4.9.2            
[129] xtable_1.8-4             jsonlite_1.6.1           tidygraph_1.1.2          pillar_1.4.3            
[133] htmltools_0.4.0          mime_0.9                 glue_1.3.1               fastmap_1.0.1           
[137] BiocNeighbors_1.4.2      codetools_0.2-16         fgsea_1.12.0             tsne_0.1-3              
[141] lattice_0.20-38          curl_4.3                 ggbeeswarm_0.6.0         openssl_1.4.1           
[145] zip_2.0.4                GO.db_3.10.0             survival_3.1-8           limma_3.42.2            
[149] docopt_0.6.1             fastICA_1.2-2            munsell_0.5.0            DO.db_2.9               
[153] GenomeInfoDbData_1.2.2   haven_2.2.0              reshape2_1.4.3           gtable_0.3.0   
grimbough commented 4 years ago

Hi Sophia,

1) The reason that getLDS() fails with mmc57bl6nj_gene_ensembl is because you're trying to use two different Marts (think of each Mart as a collection of datasets) and you can only use getLDS() to run joint queries on datasets that are in the same Mart.

Notice that in the query that works you use useMart("ensembl", ...) both times, but for the C57BL6 strain you have to use useMart("ENSEMBL_MART_MOUSE", ...). This first argument is the mart, and it needs to be the same if you want to run getLDS(). You can check the available datasets in a Mart with listDatasets(). Here we can see the "ENSEMBL_MART_MOUSE" only contains information on mice, and so it's not possible to query between macaque and a specific mouse strain:

mouse = useMart("ENSEMBL_MART_MOUSE", dataset = "mmc57bl6nj_gene_ensembl")
listDatasets(mouse)
#>                      dataset                              description
#> 1  mm129s1svimj_gene_ensembl Mouse 129S1/SvImJ genes (129S1_SvImJ_v1)
#> 2          mmaj_gene_ensembl                 Mouse A/J genes (A_J_v1)
#> 3        mmakrj_gene_ensembl             Mouse AKR/J genes (AKR_J_v1)
#> 4      mmbalbcj_gene_ensembl         Mouse BALB/cJ genes (BALB_cJ_v1)
#> 5      mmc3hhej_gene_ensembl         Mouse C3H/HeJ genes (C3H_HeJ_v1)
#> 6    mmc57bl6nj_gene_ensembl     Mouse C57BL/6NJ genes (C57BL_6NJ_v1)
#> 7     mmcasteij_gene_ensembl       Mouse CAST/EiJ genes (CAST_EiJ_v1)
#> 8        mmcbaj_gene_ensembl             Mouse CBA/J genes (CBA_J_v1)
#> 9       mmdba2j_gene_ensembl           Mouse DBA/2J genes (DBA_2J_v1)
#> 10      mmfvbnj_gene_ensembl           Mouse FVB/NJ genes (FVB_NJ_v1)
#> 11        mmlpj_gene_ensembl               Mouse LP/J genes (LP_J_v1)
#> 12  mmnodshiltj_gene_ensembl   Mouse NOD/ShiLtJ genes (NOD_ShiLtJ_v1)
#> 13   mmnzohlltj_gene_ensembl     Mouse NZO/HlLtJ genes (NZO_HlLtJ_v1)
#> 14     mmpwkphj_gene_ensembl         Mouse PWK/PhJ genes (PWK_PhJ_v1)
#> 15     mmwsbeij_gene_ensembl         Mouse WSB/EiJ genes (WSB_EiJ_v1)

2) The mmusculus_gene_ensembl dataset is a generic 'mouse' genome, where as the mmc57bl6nj_gene_ensembl is specifically for that strain of mice. If you were working with and comparing data from multiple mice strains you might want to use the Ensembl dataset specific to those strains, as the genetic differences between strains is presumably quite small and you may miss things if relying on the 'average' mouse dataset. However, it looks like you're comparing mouse with macaque and I would have thought that the standard mouse genome would be sufficient to find a mapping between those two.

3) I think the issue with Fabp7 being dropped is probably related to using the hgnc_symbol attribute for Macaque. Typically HGNC symbols related to the Human Genome in the same way that MGI symbols are Mouse Genome. I don't know how/why Ensembl assigned some HGNC symbols to Macque genes, but it looks like it's not a comprehensive assignment. You might have more luck if you use the external_gene_name attribute instead e.g.

library(biomaRt)
mouse = useMart("ensembl", dataset = "mmusculus_gene_ensembl")
macaque = useMart("ensembl", dataset = "mfascicularis_gene_ensembl")
x <- c("Fabp7", "Tlx3", "Gzmb")
genesV2 = getLDS(attributes = c("mgi_symbol"), filters = "mgi_symbol", values = x , mart = mouse, 
                 attributesL = c("external_gene_name"), martL = macaque, uniqueRows=T)
genesV2
#>   MGI.symbol Gene.name
#> 1       Tlx3      TLX3
#> 2      Fabp7     FABP7
#> 3       Gzmb      GZMB