jorainer / ensembldb

This is the ensembldb development repository.
https://jorainer.github.io/ensembldb
33 stars 10 forks source link

Cant retrieve reference information #110

Open danielcgingerich opened 3 years ago

danielcgingerich commented 3 years ago

Someone please explain to me how to get the annotation from GRCh38 2020A and convert to a GRanges object

GRCh38_2020-A<-ensDbFromGtf(gtf = "http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.primary_assembly.annotation.gtf.gz",
                            path = 'C:/Users/danie/Desktop/Seurat Objects/snATAC seq preliminary analysis/ref.genome/',
                            organism = "Homo_sapiens",
                            genomeVersion = 'GRCh38',
                            version = 98)

Importing GTF file ... trying URL 'http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.primary_assembly.annotation.gtf.gz'
Content type 'application/octet-stream' length 43107903 bytes (41.1 MB)
downloaded 41.1 MB

OK
Processing metadata ... OK
Processing genes ... 
 Attribute availability:
  o gene_id ... OK
  o gene_name ... OK
  o entrezid ... Nope
  o gene_biotype ... Nope
OK
Processing transcripts ... 
 Attribute availability:
  o transcript_id ... OK
  o gene_id ... OK
  o source ... OK
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... OK
Generating index ... Error: UNIQUE constraint failed: exon.exon_id
In addition: Warning messages:
1: In readLines(gtf, n = 10) : line 1 appears to contain an embedded nul
2: In readLines(gtf, n = 10) : line 2 appears to contain an embedded nul
3: In readLines(gtf, n = 10) : line 3 appears to contain an embedded nul
4: In readLines(gtf, n = 10) : line 6 appears to contain an embedded nul
5: In ensDbFromGRanges(GTF, outfile = outfile, path = path, organism = organism,  :
   I'm missing column(s): 'entrezid','gene_biotype'. The corresponding database column(s) will be empty!
6: In .getSeqlengthsFromMysqlFolder(organism = organism, ensembl = ensemblVersion,  :
  Could not determine length for all seqnames.

Why?

jorainer commented 3 years ago

Hi, sorry for the late reply!

According to the error message it seems that the exon identifiers in the GTF file are not unique - not much we can do about. Generally, creating EnsDb objects/databases from GTF is tricky as the GTF file format is not too standardized. Creating databases from GTF files from Ensembl should work - for the ones from Gencode I don't know.

Note that there are pre-build annotation resources for all Ensembl releases:

> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2020-11-02
> query(ah, "EnsDb.Hsapiens.v98")
AnnotationHub with 1 record
# snapshotDate(): 2020-11-02
# names(): AH75011
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2019-05-02
# $title: Ensembl 98 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based on Ensem...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("98", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl", "Gene",
#   "Protein", "Transcript") 
# retrieve record with 'object[["AH75011"]]' 

Since the Gencode 32 is based on Ensembl 98 - would this work for you?