scverse / genomic-features

Genomic Features in Python from BioConductor's AnnotationHub
https://genomic-features.readthedocs.io
BSD 3-Clause "New" or "Revised" License
18 stars 5 forks source link

Conversion of seqlevel styles #7

Open lauradmartens opened 1 year ago

lauradmartens commented 1 year ago

Description of feature

Add functionality that allows translation between different chromosome sequence naming conventions (e.g., "chr1" versus "1").

This could be similar to the seqlevelsStyle function in the R package GenomeInfoDb :

seqlevelsStyle(gr_obj) = "UCSC"

nvictus commented 3 months ago

In bioframe, we started doing this by providing an alias dictionary that maps all variants (including genbank IDs) to a single canonical name. Keeping track of naming "styles" for each provider and each species gets unwieldy, especially when ancillary scaffolds are considered (unlocalized, unplaced, alt).

https://bioframe.readthedocs.io/en/latest/guide-io.html#curated-genome-assembly-build-information

ivirshup commented 3 months ago

@nvictus, you investigated this a bunch during the hackathon. It sounded like we ended up at:

GenomeInfoDb probably has the info we want, but doesn't really make it accessible

Right?

What did GenomeInfoDb provide that bioframe doesn't? I would imagine you've covered some of the most common cases already.

ensembldb lets the user set the seqlevelsstyle like this: seqlevelsStyle(edb) <- "UCSC". Maybe we could do something similar via bioframe's assembly info?

EnsemblDB(connection, seq_style=bioframe.assembly_info(...))