missing data in human EnsDb v112?

guidohooiveld commented 3 weeks ago

Hi Johannes, When working with the human EnsDb version 112 I obtained an unexpected result. Specifically, it seems a substantial number of genes (and transcripts) are not included in the current EnsDb. This seems not to be the case with EnsDb version 111 (the previous one). Moreover, when manually checking some of the 'missing' genes I found that they are still listed/included on the v112 Ensembl website. See below for details.

Question: did I oversee something and is this expected behavior, or is indeed something wrong?

Also note: | No. of genes differs a lot between EnsDb v111 and v112 (72035 and 64102). Idem for | No. of transcripts (278721 and 215647). Also note 2: EnsDb v112 has tag |genome_build: GRCh37, but is should be GRCh38 (like v111), right?

Thanks for your feedback! G

Background: For a recent sequencing experiment (human samples) I mapped the reads to the latest GENCODE release (= release 46) using the tool salmon. Next I imported all counts in R/Bioconductor using tximeta. So far, so good!

To add annotations I use the EnsDb. GENCODE release 46 corresponds to Ensembl-release 112. I thus downloaded human EnsDb v112 from the AnnotationHub. See: https://github.com/jorainer/ensembldb/issues/154#issuecomment-2165908661

Yet, after annotating all genes I noticed I 'lost' quite a lot of genes. This is unexpected, because this did not happen before. Moreover, this is not the case when usingEnsDb v111 for annotating.

Code to illustrate this:

> ## define not-in help function
> `%!in%` = Negate(`%in%`)
> 
> ##load required libraries
> library(tximeta)
> library(EnsDb.Hsapiens.v112)
> 
> ## check EnsDb; No. of genes: 64102.
> EnsDb.Hsapiens.v112
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Fri May 17 11:25:50 2024
|ensembl_version: 112
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh37
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 64102.
| No. of transcripts: 215647.
|Protein data available.
> 
> ## check content tximeta object
> ## Number of genes is 63086
> gse.lengthscaled
class: RangedSummarizedExperiment 
dim: 63086 98 
metadata(7): tximetaInfo quantInfo ... txdbInfo assignRanges
assays(103): counts abundance ... infRep99 infRep100
rownames(63086): ENSG00000000003.16 ENSG00000000005.6 ...
  ENSG00000293599.1 ENSG00000293600.2
rowData names(6): gene_id tx_ids ... max_prop iso_prop
colnames(98): P02D1 P02D2 ... P51D1 P51D2
colData names(10): names SubjectID ... SampleName SalmonFile
> 
> ## retrieve all keys from EnsDb112, and check numbers
> ## everything makes sense!
> keys.EnsDb112 <- keys(EnsDb.Hsapiens.v112)
> length(!duplicated(keys.EnsDb112))
[1] 64102
> sum(!duplicated(keys.EnsDb112))
[1] 64102
> 
> ## do the same for tximeta object
> input.genes <- mcols(gse.lengthscaled)$gene_id
> length(!duplicated(input.genes))
[1] 63086
> sum(!duplicated(input.genes))
[1] 63086
> head(input.genes)
[1] "ENSG00000000003.16" "ENSG00000000005.6"  "ENSG00000000419.14"
[4] "ENSG00000000457.14" "ENSG00000000460.17" "ENSG00000000938.13"
> 
> ## remove version ids; nothing strange happened
> input.genes <- sub("\\..*", "", input.genes)
> length(!duplicated(input.genes))
[1] 63086
> sum(!duplicated(input.genes))
[1] 63086
> head(input.genes)
[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
[5] "ENSG00000000460" "ENSG00000000938"
> 
> ## how many input genes are in EnsDb112?
> ## only 52083? That is strange!
> present.in.db.112 <- input.genes %in% keys.EnsDb112 
> length(present.in.db.112)
[1] 63086
> sum(present.in.db.112)
[1] 52083
> 
> ## number of not annotated genes (11003!):
> length(present.in.db.112) - sum(present.in.db.112)
[1] 11003
> 
> 
> 
> ## list 12 genes that could not be annotated:
> NOT.present.in.db.112 <- input.genes %!in% keys.EnsDb112 
> length(NOT.present.in.db.112)
[1] 63086
> sum(NOT.present.in.db.112)
[1] 11003
> 
> head( input.genes[NOT.present.in.db.112] )
[1] "ENSG00000273497" "ENSG00000273499" "ENSG00000273500" "ENSG00000273507"
[5] "ENSG00000273509" "ENSG00000273512"
> tail( input.genes[NOT.present.in.db.112] )
[1] "ENSG00000293594" "ENSG00000293595" "ENSG00000293596" "ENSG00000293597"
[5] "ENSG00000293599" "ENSG00000293600"
> 
>

Yet, these not-annotated genes are listed on the ensembl website...!? (despite likely being pseudo-genes etc).

ENSG00000273497 link ENSG00000273512 link

ENSG00000293594 link ENSG00000293600 link

> ## repeat, but with previous EnsDb: version 111
> library(EnsDb.Hsapiens.v111)
> EnsDb.Hsapiens.v111
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Tue Jan 16 10:37:47 2024
|ensembl_version: 111
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 72035.
| No. of transcripts: 278721.
|Protein data available.
> 
> 
> keys.EnsDb111 <- keys(EnsDb.Hsapiens.v111)
> length(!duplicated(keys.EnsDb111))
[1] 72035
> sum(!duplicated(keys.EnsDb111))
[1] 72035
> 
> ## how many input genes are in EnsDb111?
> ## Almost all!
> present.in.db.111 <- input.genes %in% keys.EnsDb111 
> length(present.in.db.111)
[1] 63086
> sum(present.in.db.111)
[1] 63060
> 
> ## number of not annotated genes (only 26!)
> length(present.in.db.111) - sum(present.in.db.111)
[1] 26
> 
>

zyh4482 commented 2 weeks ago

I found a terrorible mistake... genome_build: GRCh37 for v112 and genome_build: GRCh38 for v111.

I understand that liftover can take care of it. But I still prefer to use GRCh38 version...

Do you have any idea how to change the annotation database to GRCh38 version?

Update: Sorry for the redundant comment. I think I found how to make it. AnnotationHub::query(ah, pattern = c("EnsDb","Homo sapiens", "GRCh38"))

> AnnotationHub::query(ah, pattern = c("EnsDb","Homo sapiens","GRCh38"))
AnnotationHub with 26 records
# snapshotDate(): 2024-04-30
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
#   tags, rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

             title                             
  AH53211  | Ensembl 87 EnsDb for Homo Sapiens 
  AH53715  | Ensembl 88 EnsDb for Homo Sapiens 
  AH56681  | Ensembl 89 EnsDb for Homo Sapiens 
  AH57757  | Ensembl 90 EnsDb for Homo Sapiens 
  AH60773  | Ensembl 91 EnsDb for Homo Sapiens 
  ...        ...                               
  AH104864 | Ensembl 107 EnsDb for Homo sapiens
  AH109336 | Ensembl 108 EnsDb for Homo sapiens
  AH109606 | Ensembl 109 EnsDb for Homo sapiens
  AH113665 | Ensembl 110 EnsDb for Homo sapiens
  AH116291 | Ensembl 111 EnsDb for Homo sapiens
> AnnotationHub::query(ah, pattern = c("EnsDb","Homo sapiens"))
AnnotationHub with 27 records
# snapshotDate(): 2024-04-30
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
#   tags, rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

             title                             
  AH53211  | Ensembl 87 EnsDb for Homo Sapiens 
  AH53715  | Ensembl 88 EnsDb for Homo Sapiens 
  AH56681  | Ensembl 89 EnsDb for Homo Sapiens 
  AH57757  | Ensembl 90 EnsDb for Homo Sapiens 
  AH60773  | Ensembl 91 EnsDb for Homo Sapiens 
  ...        ...                               
  AH109336 | Ensembl 108 EnsDb for Homo sapiens
  AH109606 | Ensembl 109 EnsDb for Homo sapiens
  AH113665 | Ensembl 110 EnsDb for Homo sapiens
  AH116291 | Ensembl 111 EnsDb for Homo sapiens
  AH116860 | Ensembl 112 EnsDb for Homo sapiens

Currently, EnsemblDB v112 is not available for GRCh38.

Maybe this will affect the output?

zyh4482 commented 2 weeks ago

As a result, using the latest EnsemblDB v112, I could map fewer genes.

# 66214 found
# 65692/77379 OK

But using EnsemblDB v111, I could get most of my genes.

77373 found
77306/77379 OK

Is there anyway to fix it?