Open guidohooiveld opened 3 weeks ago
I found a terrorible mistake... genome_build: GRCh37
for v112 and genome_build: GRCh38
for v111.
I understand that liftover can take care of it. But I still prefer to use GRCh38 version...
Do you have any idea how to change the annotation database to GRCh38 version?
Update:
Sorry for the redundant comment. I think I found how to make it.
AnnotationHub::query(ah, pattern = c("EnsDb","Homo sapiens", "GRCh38"))
> AnnotationHub::query(ah, pattern = c("EnsDb","Homo sapiens","GRCh38"))
AnnotationHub with 26 records
# snapshotDate(): 2024-04-30
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass,
# tags, rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53211"]]'
title
AH53211 | Ensembl 87 EnsDb for Homo Sapiens
AH53715 | Ensembl 88 EnsDb for Homo Sapiens
AH56681 | Ensembl 89 EnsDb for Homo Sapiens
AH57757 | Ensembl 90 EnsDb for Homo Sapiens
AH60773 | Ensembl 91 EnsDb for Homo Sapiens
... ...
AH104864 | Ensembl 107 EnsDb for Homo sapiens
AH109336 | Ensembl 108 EnsDb for Homo sapiens
AH109606 | Ensembl 109 EnsDb for Homo sapiens
AH113665 | Ensembl 110 EnsDb for Homo sapiens
AH116291 | Ensembl 111 EnsDb for Homo sapiens
> AnnotationHub::query(ah, pattern = c("EnsDb","Homo sapiens"))
AnnotationHub with 27 records
# snapshotDate(): 2024-04-30
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass,
# tags, rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53211"]]'
title
AH53211 | Ensembl 87 EnsDb for Homo Sapiens
AH53715 | Ensembl 88 EnsDb for Homo Sapiens
AH56681 | Ensembl 89 EnsDb for Homo Sapiens
AH57757 | Ensembl 90 EnsDb for Homo Sapiens
AH60773 | Ensembl 91 EnsDb for Homo Sapiens
... ...
AH109336 | Ensembl 108 EnsDb for Homo sapiens
AH109606 | Ensembl 109 EnsDb for Homo sapiens
AH113665 | Ensembl 110 EnsDb for Homo sapiens
AH116291 | Ensembl 111 EnsDb for Homo sapiens
AH116860 | Ensembl 112 EnsDb for Homo sapiens
Currently, EnsemblDB v112 is not available for GRCh38.
Maybe this will affect the output?
As a result, using the latest EnsemblDB v112, I could map fewer genes.
# 66214 found
# 65692/77379 OK
But using EnsemblDB v111, I could get most of my genes.
77373 found
77306/77379 OK
Is there anyway to fix it?
This is a bit surprising. I will dig a bit into this. I do not specifically set any Genome build version (AFAIK) but simply querying data from Ensembl using the perl API and putting that into SQLite databases (that are shared through AnnotationHub
).
Uh, yes, Ensembl release 112 has two core databases, one for GRCH37 and one for GRCH38. My automatic script picked up the first, which happened to be the GRCH37 :( . I'll create the correct version and fix/update the file in AnnotationHub.
Thanks for having a look at this discrepancy, and finding and fixing the source of it so quickly. :)
Since processing of files on the AnnotationHub
apparently may take some time; would you mind making the fixed database temporarily available through e.g. Dropbox as well? I/we can then fire-up our pipelines right away. :)
Sure - I'm just at a conference at present and bandwidth is limited ... so all takes a bit longer. but I'll post a link here once I'm done (seems only the human annotations were affected, so I'll replace that)
Thank you so much!
thanks for spotting this issue! I was not aware (and was not expecting) that Ensembl creates now two flavors of annotations, one for GRCH37 and one for GRCH38!
so, the EnsDb
for GrCh38 is available here. I'm also updating/replacing AnnotationHub, but we have to figure out how to best do that.
Thanks for the updated file. I have downloaded it, and based on my tests it looks fine now. That is, all my input genes (63086) are being annotated now (in the previous, GRCH37-based version 11003 genes were missing). I will do some additional tests (with transcripts) and report back if needed.
Updated database:
> EnsDb.Hsapiens.v112
EnsDb for Ensembl:
|Backend: SQLite
|Db type: EnsDb
|Type of Gene ID: Ensembl Gene ID
|Supporting package: ensembldb
|Db created by: ensembldb package from Bioconductor
|script_version: 0.3.10
|Creation time: Mon Jun 17 16:14:46 2024
|ensembl_version: 112
|ensembl_host: localhost
|Organism: Homo sapiens
|taxonomy_id: 9606
|genome_build: GRCh38
|DBSCHEMAVERSION: 2.2
|common_name: human
|species: homo_sapiens
| No. of genes: 71935.
| No. of transcripts: 279860.
|Protein data available.
>
Also obtained expected results on the level of transcripts, so as far as I am concerned the updated database is correct.
@zyh4482 : what about the outcomes of your analyses?
For my dataset, I got:
Fetching CDS for 77379 proteins ... 77379 found
Checking CDS and protein sequence lengths ... 77312/77379 OK
At this stage, everything looks perfect.
Previously, I used v111 for my downstream analysis. I found some discrepancies regarding to unmapped genomic coordinates. I'll update my results later.
I finished my analysis. This database is perfect to use. Thanks again! @jorainer
Hi Johannes, When working with the human
EnsDb
version 112 I obtained an unexpected result. Specifically, it seems a substantial number of genes (and transcripts) are not included in the currentEnsDb
. This seems not to be the case withEnsDb
version 111 (the previous one). Moreover, when manually checking some of the 'missing' genes I found that they are still listed/included on the v112 Ensembl website. See below for details.Question: did I oversee something and is this expected behavior, or is indeed something wrong?
Also note:
| No. of genes
differs a lot betweenEnsDb
v111 and v112 (72035 and 64102). Idem for| No. of transcripts
(278721 and 215647). Also note 2:EnsDb
v112 has tag|genome_build: GRCh37
, but is should beGRCh38
(like v111), right?Thanks for your feedback! G
Background: For a recent sequencing experiment (human samples) I mapped the reads to the latest GENCODE release (= release 46) using the tool
salmon
. Next I imported all counts in R/Bioconductor usingtximeta
. So far, so good!To add annotations I use the
EnsDb
. GENCODE release 46 corresponds to Ensembl-release 112. I thus downloaded humanEnsDb
v112 from theAnnotationHub
. See: https://github.com/jorainer/ensembldb/issues/154#issuecomment-2165908661Yet, after annotating all genes I noticed I 'lost' quite a lot of genes. This is unexpected, because this did not happen before. Moreover, this is not the case when using
EnsDb
v111 for annotating.Code to illustrate this:
Yet, these not-annotated genes are listed on the ensembl website...!? (despite likely being pseudo-genes etc).
ENSG00000273497 link ENSG00000273512 link
ENSG00000293594 link ENSG00000293600 link