jorainer / ensembldb

This is the ensembldb development repository.
https://jorainer.github.io/ensembldb
33 stars 10 forks source link

Compatibility between `EnsDb` objects and `BSgenome` objects #159

Open mschubert opened 3 months ago

mschubert commented 3 months ago

I'm trying to use the VariantAnnotation package to annotate a VCF file from nf-core/sarek (UCSC style) using AnnotationHub EnsDb objects (Ensembl style), where I encountered the following issues:

  1. The VariantAnnotation package does not accept EnsDb objects. I raised this (https://github.com/Bioconductor/VariantAnnotation/pull/74), but it will probably not get fixed; I worked around this by providing my own S4 method in my code
  2. The seqlevelsStyles() do not match

For point (2), I can change the EnsDb style to UCSC:

ens106 = AnnotationHub::AnnotationHub()[["AH100643"]]
seqlevelsStyle(ens106) = "UCSC"

and then either load UCSC-style genome or change the Ensembl-style genome to UCSC:

asm = BSgenome.Hsapiens.NCBI.GRCh38::BSgenome.Hsapiens.NCBI.GRCh38
seqlevelsStyle(asm) = "UCSC"

asm = BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38

However, this does not work, because the genome() of the EnsDb object is GRCh38, while the one of the assembly is hg38, raising an assertion error in VariantAnnotation. So this needs an additional line changing the internal state of the S4 genome object (we can't change the EnsDb object):

genome(ens106)[] = "hg38"
# Error: unable to find an inherited method for function 'seqinfo<-' for signature 'x = "EnsDb"'

asm@seqinfo@genome[] = "GRCh38" # works, but is messing with internals
# this was previously the default, but they explicitly changed it

I'm not sure what a good solution is here. It seems to be that a check if they genomes are identical (as performed by VariantAnnotation) is reasonable. I'm raising this issue more to document it rather than suggesting a change in ensembldb.

jorainer commented 2 months ago

Thanks for reporting. Is there no way to create or load a BSgenome or other supported genome sequence object from AnnotationHub? There should also be the twobit files from Ensembl accessible through AnnotationHub - maybe these can be used? The advantage would be that you could ensure to use matching EnsDb and genome sequence from the same Ensembl release. I'm just a bit worried that hg38 is not exactly identical to the GRCh38 version used by Ensembl...