Closed hpages closed 8 months ago
@sanchit-saini , is this something you could look at?
Usually UCSC genome names matches with provided genome name, However it is not the case with these two genomes.
library(rtracklayer)
session <- browserSession()
# by default genome is set to hg38
genome(session)
# [1] "hg38"
# explicitly setting up the genome to xenTro9
responseTxt <- rtracklayer:::ucscGet(session, "gateway", list(db = "xenTro9"))
genome(session)
# [1] "xenTro9"
# explicitly setting up the genome to mpxvRivers
responseTxt <- rtracklayer:::ucscGet(session, "gateway", list(db = "mpxvRivers"))
genome(session)
# [1] "hub_581817_mpxvRivers"
responseTxt <- rtracklayer:::ucscGet(session, "gateway", list(db = "hs1"))
genome(session)
# [1] "hub_567047_hs1"
To validate rtracklayer is comparing the genome name returned from the UCSC and the genome name which is provided by the user.
https://github.com/lawremi/rtracklayer/blob/ab6876b8d723947ec813385dbc4350e49521d916/R/ucsc.R#L128
I think to improve and handle it, we could change the checking expression instead of simply comparing the values, we could case-insensitively check whether the returned UCSC genome name contains the user provided genome name in it.
By doing this, I think we will be able to handle the majority of the cases, assuming UCSC genome is going to contain the original genome name.
Possible changes:
value <- "mpxvRivers"
# genome(session) is hub_581817_mpxvRivers
if (!grepl(tolower(value), tolower(genome(session)), fixed = TRUE))
stop("Failed to set session genome to '", value, "'")
Also, we need to handle the genome name conversion, so it could be understood by other Bioconductor functions. In our case, we only need it to handle it in one place.
https://github.com/lawremi/rtracklayer/blob/ab6876b8d723947ec813385dbc4350e49521d916/R/ucsc.R#L322
According to SeqInfo
genome name hub_581817_mpxvRivers
is going to be invalid, However, correct genome name could extract from the string assuming the format of the string is going to be consistent for other cases too.
if (grepl("_", genome)) {
genome <- unlist(strsplit(genome, "_"))
genome <- genome[length(genome)]
}
After extracting the genome with the above code, we could make a call to the SeqInfo()
.
Now, I am not sure whether this naming convention issue is temporary or not. If it is not, I will create a PR with these changes to handle it.
I would prefer to avoid any heuristics, like a contains check. Perhaps there is a way to extract the genome identifier other than the table name.
The best approach for now is to strip off the hub_XXX_
prefix. You can do that with gsub(".*_", "", x)
. That's assuming there are no genomes that have "_" as part of their identifier.
Despite the fact that
hs1
andmpxvRivers
are valid UCSC genomes:I'm actually not sure that this has ever worked.
Works fine for other UCSC genomes:
Note that
hs1
andmpxvRivers
are the latest additions to the UCSC Genome Browser and it could be that the UCSC folks decided to do things a little bit differently with these 2 genomes.Thanks, H.
sessionInfo():