jorainer / ensembldb

This is the ensembldb development repository.
https://jorainer.github.io/ensembldb
33 stars 10 forks source link

missing data in human EnsDb v112? #155

Open guidohooiveld opened 3 weeks ago

guidohooiveld commented 3 weeks ago

Hi Johannes, When working with the human EnsDb version 112 I obtained an unexpected result. Specifically, it seems a substantial number of genes (and transcripts) are not included in the current EnsDb. This seems not to be the case with EnsDb version 111 (the previous one). Moreover, when manually checking some of the 'missing' genes I found that they are still listed/included on the v112 Ensembl website. See below for details.

Question: did I oversee something and is this expected behavior, or is indeed something wrong?

Also note: | No. of genes differs a lot between EnsDb v111 and v112 (72035 and 64102). Idem for | No. of transcripts (278721 and 215647). Also note 2: EnsDb v112 has tag |genome_build: GRCh37, but is should be GRCh38 (like v111), right?

Thanks for your feedback! G

Background: For a recent sequencing experiment (human samples) I mapped the reads to the latest GENCODE release (= release 46) using the tool salmon. Next I imported all counts in R/Bioconductor using tximeta. So far, so good!

To add annotations I use the EnsDb. GENCODE release 46 corresponds to Ensembl-release 112. I thus downloaded human EnsDb v112 from the AnnotationHub. See: https://github.com/jorainer/ensembldb/issues/154#issuecomment-2165908661

Yet, after annotating all genes I noticed I 'lost' quite a lot of genes. This is unexpected, because this did not happen before. Moreover, this is not the case when usingEnsDb v111 for annotating.

Code to illustrate this:

> ## define not-in help function
> `%!in%` = Negate(`%in%`)
> 
> ##load required libraries
> library(tximeta)
> library(EnsDb.Hsapiens.v112)
> 
> ## check EnsDb; No. of genes: 64102.
> EnsDb.Hsapiens.v112
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Fri May 17 11:25:50 2024
|ensembl_version: 112
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh37
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 64102.
| No. of transcripts: 215647.
|Protein data available.
> 
> ## check content tximeta object
> ## Number of genes is 63086
> gse.lengthscaled
class: RangedSummarizedExperiment 
dim: 63086 98 
metadata(7): tximetaInfo quantInfo ... txdbInfo assignRanges
assays(103): counts abundance ... infRep99 infRep100
rownames(63086): ENSG00000000003.16 ENSG00000000005.6 ...
  ENSG00000293599.1 ENSG00000293600.2
rowData names(6): gene_id tx_ids ... max_prop iso_prop
colnames(98): P02D1 P02D2 ... P51D1 P51D2
colData names(10): names SubjectID ... SampleName SalmonFile
> 
> ## retrieve all keys from EnsDb112, and check numbers
> ## everything makes sense!
> keys.EnsDb112 <- keys(EnsDb.Hsapiens.v112)
> length(!duplicated(keys.EnsDb112))
[1] 64102
> sum(!duplicated(keys.EnsDb112))
[1] 64102
> 
> ## do the same for tximeta object
> input.genes <- mcols(gse.lengthscaled)$gene_id
> length(!duplicated(input.genes))
[1] 63086
> sum(!duplicated(input.genes))
[1] 63086
> head(input.genes)
[1] "ENSG00000000003.16" "ENSG00000000005.6"  "ENSG00000000419.14"
[4] "ENSG00000000457.14" "ENSG00000000460.17" "ENSG00000000938.13"
> 
> ## remove version ids; nothing strange happened
> input.genes <- sub("\\..*", "", input.genes)
> length(!duplicated(input.genes))
[1] 63086
> sum(!duplicated(input.genes))
[1] 63086
> head(input.genes)
[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
[5] "ENSG00000000460" "ENSG00000000938"
> 
> ## how many input genes are in EnsDb112?
> ## only 52083? That is strange!
> present.in.db.112 <- input.genes %in% keys.EnsDb112 
> length(present.in.db.112)
[1] 63086
> sum(present.in.db.112)
[1] 52083
> 
> ## number of not annotated genes (11003!):
> length(present.in.db.112) - sum(present.in.db.112)
[1] 11003
> 
> 
> 
> ## list 12 genes that could not be annotated:
> NOT.present.in.db.112 <- input.genes %!in% keys.EnsDb112 
> length(NOT.present.in.db.112)
[1] 63086
> sum(NOT.present.in.db.112)
[1] 11003
> 
> head( input.genes[NOT.present.in.db.112] )
[1] "ENSG00000273497" "ENSG00000273499" "ENSG00000273500" "ENSG00000273507"
[5] "ENSG00000273509" "ENSG00000273512"
> tail( input.genes[NOT.present.in.db.112] )
[1] "ENSG00000293594" "ENSG00000293595" "ENSG00000293596" "ENSG00000293597"
[5] "ENSG00000293599" "ENSG00000293600"
> 
>

Yet, these not-annotated genes are listed on the ensembl website...!? (despite likely being pseudo-genes etc).

ENSG00000273497 link ENSG00000273512 link

ENSG00000293594 link ENSG00000293600 link

> ## repeat, but with previous EnsDb: version 111
> library(EnsDb.Hsapiens.v111)
> EnsDb.Hsapiens.v111
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Tue Jan 16 10:37:47 2024
|ensembl_version: 111
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 72035.
| No. of transcripts: 278721.
|Protein data available.
> 
> 
> keys.EnsDb111 <- keys(EnsDb.Hsapiens.v111)
> length(!duplicated(keys.EnsDb111))
[1] 72035
> sum(!duplicated(keys.EnsDb111))
[1] 72035
> 
> ## how many input genes are in EnsDb111?
> ## Almost all!
> present.in.db.111 <- input.genes %in% keys.EnsDb111 
> length(present.in.db.111)
[1] 63086
> sum(present.in.db.111)
[1] 63060
> 
> ## number of not annotated genes (only 26!)
> length(present.in.db.111) - sum(present.in.db.111)
[1] 26
> 
>
zyh4482 commented 2 weeks ago

I found a terrorible mistake... genome_build: GRCh37 for v112 and genome_build: GRCh38 for v111.

I understand that liftover can take care of it. But I still prefer to use GRCh38 version...

Do you have any idea how to change the annotation database to GRCh38 version?

Update: Sorry for the redundant comment. I think I found how to make it. AnnotationHub::query(ah, pattern = c("EnsDb","Homo sapiens", "GRCh38"))

> AnnotationHub::query(ah, pattern = c("EnsDb","Homo sapiens","GRCh38"))
AnnotationHub with 26 records
# snapshotDate(): 2024-04-30
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
#   tags, rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

             title                             
  AH53211  | Ensembl 87 EnsDb for Homo Sapiens 
  AH53715  | Ensembl 88 EnsDb for Homo Sapiens 
  AH56681  | Ensembl 89 EnsDb for Homo Sapiens 
  AH57757  | Ensembl 90 EnsDb for Homo Sapiens 
  AH60773  | Ensembl 91 EnsDb for Homo Sapiens 
  ...        ...                               
  AH104864 | Ensembl 107 EnsDb for Homo sapiens
  AH109336 | Ensembl 108 EnsDb for Homo sapiens
  AH109606 | Ensembl 109 EnsDb for Homo sapiens
  AH113665 | Ensembl 110 EnsDb for Homo sapiens
  AH116291 | Ensembl 111 EnsDb for Homo sapiens
> AnnotationHub::query(ah, pattern = c("EnsDb","Homo sapiens"))
AnnotationHub with 27 records
# snapshotDate(): 2024-04-30
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
#   tags, rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53211"]]' 

             title                             
  AH53211  | Ensembl 87 EnsDb for Homo Sapiens 
  AH53715  | Ensembl 88 EnsDb for Homo Sapiens 
  AH56681  | Ensembl 89 EnsDb for Homo Sapiens 
  AH57757  | Ensembl 90 EnsDb for Homo Sapiens 
  AH60773  | Ensembl 91 EnsDb for Homo Sapiens 
  ...        ...                               
  AH109336 | Ensembl 108 EnsDb for Homo sapiens
  AH109606 | Ensembl 109 EnsDb for Homo sapiens
  AH113665 | Ensembl 110 EnsDb for Homo sapiens
  AH116291 | Ensembl 111 EnsDb for Homo sapiens
  AH116860 | Ensembl 112 EnsDb for Homo sapiens

Currently, EnsemblDB v112 is not available for GRCh38.

Maybe this will affect the output?

zyh4482 commented 2 weeks ago

As a result, using the latest EnsemblDB v112, I could map fewer genes.

# 66214 found
# 65692/77379 OK

But using EnsemblDB v111, I could get most of my genes.

77373 found
77306/77379 OK

Is there anyway to fix it?

jorainer commented 2 weeks ago

This is a bit surprising. I will dig a bit into this. I do not specifically set any Genome build version (AFAIK) but simply querying data from Ensembl using the perl API and putting that into SQLite databases (that are shared through AnnotationHub).

jorainer commented 2 weeks ago

Uh, yes, Ensembl release 112 has two core databases, one for GRCH37 and one for GRCH38. My automatic script picked up the first, which happened to be the GRCH37 :( . I'll create the correct version and fix/update the file in AnnotationHub.

guidohooiveld commented 2 weeks ago

Thanks for having a look at this discrepancy, and finding and fixing the source of it so quickly. :)

Since processing of files on the AnnotationHub apparently may take some time; would you mind making the fixed database temporarily available through e.g. Dropbox as well? I/we can then fire-up our pipelines right away. :)

jorainer commented 2 weeks ago

Sure - I'm just at a conference at present and bandwidth is limited ... so all takes a bit longer. but I'll post a link here once I'm done (seems only the human annotations were affected, so I'll replace that)

zyh4482 commented 2 weeks ago

Thank you so much!

jorainer commented 2 weeks ago

thanks for spotting this issue! I was not aware (and was not expecting) that Ensembl creates now two flavors of annotations, one for GRCH37 and one for GRCH38!

jorainer commented 2 weeks ago

so, the EnsDb for GrCh38 is available here. I'm also updating/replacing AnnotationHub, but we have to figure out how to best do that.

guidohooiveld commented 2 weeks ago

Thanks for the updated file. I have downloaded it, and based on my tests it looks fine now. That is, all my input genes (63086) are being annotated now (in the previous, GRCH37-based version 11003 genes were missing). I will do some additional tests (with transcripts) and report back if needed.

Updated database:

> EnsDb.Hsapiens.v112
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Mon Jun 17 16:14:46 2024
|ensembl_version: 112
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 71935.
| No. of transcripts: 279860.
|Protein data available.
>
guidohooiveld commented 2 weeks ago

Also obtained expected results on the level of transcripts, so as far as I am concerned the updated database is correct.

@zyh4482 : what about the outcomes of your analyses?

zyh4482 commented 2 weeks ago

For my dataset, I got:

Fetching CDS for 77379 proteins ... 77379 found
Checking CDS and protein sequence lengths ... 77312/77379 OK

At this stage, everything looks perfect.

Previously, I used v111 for my downstream analysis. I found some discrepancies regarding to unmapped genomic coordinates. I'll update my results later.

zyh4482 commented 2 weeks ago

I finished my analysis. This database is perfect to use. Thanks again! @jorainer