lawremi / rtracklayer

R interface to genome annotation files and the UCSC genome browser
Other
26 stars 16 forks source link

Can't create a browser session for "hs1" or "mpxvRivers" genomes #90

Closed hpages closed 8 months ago

hpages commented 10 months ago

Despite the fact that hs1 and mpxvRivers are valid UCSC genomes:

library(rtracklayer)

## hs1 and mpxvRivers are valid genomes:
c("hs1", "mpxvRivers") %in% ucscGenomes()$db
# [1] TRUE TRUE

session <- browserSession()

genome(session) <- "hs1"
# Error in `genome<-`(`*tmp*`, value = "hs1") : 
#    Failed to set session genome to 'hs1'

genome(session) <- "mpxvRivers"
# Error in `genome<-`(`*tmp*`, value = "mpxvRivers") : 
#   Failed to set session genome to 'mpxvRivers'

I'm actually not sure that this has ever worked.

Works fine for other UCSC genomes:

genome(session) <- "hg38"     # OK
genome(session) <- "xenTro9"  # OK
genome(session) <- "wuhCor1"  # OK
genome(session) <- "ochPri3"  # OK
etc...

Note that hs1 and mpxvRivers are the latest additions to the UCSC Genome Browser and it could be that the UCSC folks decided to do things a little bit differently with these 2 genomes.

Thanks, H.

sessionInfo():

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 23.04

Matrix products: default
BLAS:   /home/hpages/R/R-4.3.0/lib/libRblas.so 
LAPACK: /home/hpages/R/R-4.3.0/lib/libRlapack.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] rtracklayer_1.61.1     GenomicFeatures_1.53.2 AnnotationDbi_1.63.2  
[4] Biobase_2.61.0         GenomicRanges_1.53.1   GenomeInfoDb_1.37.4   
[7] IRanges_2.35.2         S4Vectors_0.39.1       BiocGenerics_0.47.0   

loaded via a namespace (and not attached):
 [1] KEGGREST_1.41.0             SummarizedExperiment_1.31.1
 [3] rjson_0.2.21                lattice_0.21-8             
 [5] RMariaDB_1.2.2              vctrs_0.6.3                
 [7] tools_4.3.0                 bitops_1.0-7               
 [9] generics_0.1.3              curl_5.0.2                 
[11] parallel_4.3.0              tibble_3.2.1               
[13] fansi_1.0.4                 RSQLite_2.3.1              
[15] blob_1.2.4                  pkgconfig_2.0.3            
[17] Matrix_1.6-1                dbplyr_2.3.3               
[19] lifecycle_1.0.3             GenomeInfoDbData_1.2.10    
[21] compiler_4.3.0              stringr_1.5.0              
[23] Rsamtools_2.17.0            Biostrings_2.69.2          
[25] progress_1.2.2              codetools_0.2-19           
[27] RCurl_1.98-1.12             yaml_2.3.7                 
[29] pillar_1.9.0                crayon_1.5.2               
[31] BiocParallel_1.35.4         DelayedArray_0.27.10       
[33] cachem_1.0.8                abind_1.4-5                
[35] tidyselect_1.2.0            digest_0.6.33              
[37] stringi_1.7.12              dplyr_1.1.3                
[39] restfulr_0.0.15             grid_4.3.0                 
[41] biomaRt_2.57.1              fastmap_1.1.1              
[43] SparseArray_1.1.12          cli_3.6.1                  
[45] magrittr_2.0.3              S4Arrays_1.1.6             
[47] XML_3.99-0.14               utf8_1.2.3                 
[49] prettyunits_1.1.1           filelock_1.0.2             
[51] rappdirs_0.3.3              bit64_4.0.5                
[53] XVector_0.41.1              httr_1.4.7                 
[55] matrixStats_1.0.0           bit_4.0.5                  
[57] png_0.1-8                   hms_1.1.3                  
[59] memoise_2.0.1               BiocIO_1.11.0              
[61] BiocFileCache_2.9.1         rlang_1.1.1                
[63] Rcpp_1.0.11                 glue_1.6.2                 
[65] DBI_1.1.3                   xml2_1.3.5                 
[67] R6_2.5.1                    MatrixGenerics_1.13.1      
[69] GenomicAlignments_1.37.0    zlibbioc_1.47.0            
lawremi commented 10 months ago

@sanchit-saini , is this something you could look at?

sanchit-saini commented 10 months ago

Usually UCSC genome names matches with provided genome name, However it is not the case with these two genomes.

library(rtracklayer)

session <- browserSession()
# by default genome is set to hg38  
genome(session)
# [1] "hg38"

# explicitly setting up the genome to xenTro9
responseTxt <- rtracklayer:::ucscGet(session, "gateway", list(db = "xenTro9"))
genome(session)
# [1] "xenTro9"

# explicitly setting up the genome to mpxvRivers
responseTxt <- rtracklayer:::ucscGet(session, "gateway", list(db = "mpxvRivers"))
genome(session)
# [1] "hub_581817_mpxvRivers"

responseTxt <- rtracklayer:::ucscGet(session, "gateway", list(db = "hs1"))
genome(session)
# [1] "hub_567047_hs1"

To validate rtracklayer is comparing the genome name returned from the UCSC and the genome name which is provided by the user.

https://github.com/lawremi/rtracklayer/blob/ab6876b8d723947ec813385dbc4350e49521d916/R/ucsc.R#L128

I think to improve and handle it, we could change the checking expression instead of simply comparing the values, we could case-insensitively check whether the returned UCSC genome name contains the user provided genome name in it.

By doing this, I think we will be able to handle the majority of the cases, assuming UCSC genome is going to contain the original genome name.

Possible changes:

value <- "mpxvRivers"
# genome(session) is hub_581817_mpxvRivers
if (!grepl(tolower(value), tolower(genome(session)), fixed = TRUE))
    stop("Failed to set session genome to '", value, "'")

Also, we need to handle the genome name conversion, so it could be understood by other Bioconductor functions. In our case, we only need it to handle it in one place.

https://github.com/lawremi/rtracklayer/blob/ab6876b8d723947ec813385dbc4350e49521d916/R/ucsc.R#L322

According to SeqInfo genome name hub_581817_mpxvRivers is going to be invalid, However, correct genome name could extract from the string assuming the format of the string is going to be consistent for other cases too.

if (grepl("_", genome)) {
    genome <- unlist(strsplit(genome, "_"))
    genome <- genome[length(genome)]
}

After extracting the genome with the above code, we could make a call to the SeqInfo().

Now, I am not sure whether this naming convention issue is temporary or not. If it is not, I will create a PR with these changes to handle it.

lawremi commented 9 months ago

I would prefer to avoid any heuristics, like a contains check. Perhaps there is a way to extract the genome identifier other than the table name.

lawremi commented 9 months ago

The best approach for now is to strip off the hub_XXX_ prefix. You can do that with gsub(".*_", "", x). That's assuming there are no genomes that have "_" as part of their identifier.