jorainer / ensembldb

This is the ensembldb development repository.
https://jorainer.github.io/ensembldb
33 stars 10 forks source link

Build a human hg38 EnsDb of an older Ensembl release (v79) #99

Closed ccwang002 closed 5 years ago

ccwang002 commented 5 years ago

I was wondering if you can help build the human hg38 EnsDb of Ensembl release v79.

We need this specific version because the data of most cancer consortia are processed and processed on NCI Genomic Data Commons (GDC), which uses GENCODE v22 as their gene annotation. GENCODE v22 should be equivalent to Ensembl v79 based on the comments in its GTF.

I tried to convert GDC's GTF directly into a TxDb-compatiable SQLite database but I couldn't get it working. GenomicFeatures::makeTxDbFromGFF() dropped many valuable information including gene symbol and biotype, making the resulting TxDb less useful. ensembldb::ensDbFromGtf() failed at the index creation step:

ensDbFromGtf(
    GDC_GTF_PTH, 'EnsDb_GDC_gencode_v22.sqlite',
    organism = "Homo_sapiens",
    genomeVersion = "GRCh38",
    version = 79
)
Importing GTF file ... OK
outfile specified, thus I will discard the path argument.Processing metadata ... OK
Processing genes ... 
 I'm missing column(s): 'entrezid','gene_biotype'. The corresponding database column(s) will be empty! Attribute availability:
  o gene_id ... OK
  o gene_name ... OK
  o entrezid ... Nope
  o gene_biotype ... Nope
OK
Processing transcripts ... 
 Attribute availability:
  o transcript_id ... OK
  o gene_id ... OK
  o source ... OK
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... Could not determine length for all seqnames.FAIL
Unable to retrieve sequence lengths from Ensembl.OK
Generating index ... Error in result_create(conn@ptr, statement) : 
  UNIQUE constraint failed: exon.exon_id

Therefore, it might be easier to build a standard Ensembl v79 EnsDb from scratch. If you can build one and make it accessible via AnnotationHub, it will also benefit the broader community using data from GDC.

jorainer commented 5 years ago

Good point. I will make them - just have to finish first with Ensembl 97.

jorainer commented 5 years ago

Just added to AnnotationHub:

> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2019-05-02
> query(ah, c("EnsDb", "v79"))
AnnotationHub with 1 record
# snapshotDate(): 2019-05-02 
# names(): AH73986
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2019-05-02
# $title: Ensembl 79 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("79", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
#   "Protein", "Transcript") 
# retrieve record with 'object[["AH73986"]]' 
ccwang002 commented 5 years ago

Wow, thanks for making it available in such as short time! I just downloaded it and the ensdb has everything I need. Really appreciate your help!