Open lauradmartens opened 1 year ago
In bioframe, we started doing this by providing an alias dictionary that maps all variants (including genbank IDs) to a single canonical name. Keeping track of naming "styles" for each provider and each species gets unwieldy, especially when ancillary scaffolds are considered (unlocalized, unplaced, alt).
https://bioframe.readthedocs.io/en/latest/guide-io.html#curated-genome-assembly-build-information
@nvictus, you investigated this a bunch during the hackathon. It sounded like we ended up at:
GenomeInfoDb probably has the info we want, but doesn't really make it accessible
Right?
What did GenomeInfoDb provide that bioframe doesn't? I would imagine you've covered some of the most common cases already.
ensembldb
lets the user set the seqlevelsstyle like this: seqlevelsStyle(edb) <- "UCSC"
. Maybe we could do something similar via bioframe's assembly info?
EnsemblDB(connection, seq_style=bioframe.assembly_info(...))
Description of feature
Add functionality that allows translation between different chromosome sequence naming conventions (e.g., "chr1" versus "1").
This could be similar to the
seqlevelsStyle
function in the R package GenomeInfoDb :seqlevelsStyle(gr_obj) = "UCSC"