grimbough / biomaRt

R package providing query functionality to BioMart instances like Ensembl
https://bioconductor.org/packages/biomaRt/
34 stars 13 forks source link

ENSEMBLE 88 #107

Closed AR-Shicheng closed 2 weeks ago

AR-Shicheng commented 1 month ago

Hi Mike,

I am wondering why only ENSEMBLE 88 is missed from BiomaRt package?

Thanks,

Shicheng

AR-Shicheng commented 1 month ago

Pretty interesting, the most important ENSEMBLE 88 is missing.

> library(biomaRt)
Warning: program compiled against libxml 210 using older 209
>
> listEnsembl()

        biomart                version
1         genes      Ensembl Genes 112
2 mouse_strains      Mouse strains 112
3          snps  Ensembl Variation 112
4    regulation Ensembl Regulation 112
>
> listEnsemblArchives()
             name     date                                 url version
1  Ensembl GRCh37 Feb 2014          https://grch37.ensembl.org  GRCh37
2     Ensembl 112 May 2024 https://may2024.archive.ensembl.org     112
3     Ensembl 111 Jan 2024 https://jan2024.archive.ensembl.org     111
4     Ensembl 110 Jul 2023 https://jul2023.archive.ensembl.org     110
5     Ensembl 109 Feb 2023 https://feb2023.archive.ensembl.org     109
6     Ensembl 108 Oct 2022 https://oct2022.archive.ensembl.org     108
7     Ensembl 107 Jul 2022 https://jul2022.archive.ensembl.org     107
8     Ensembl 106 Apr 2022 https://apr2022.archive.ensembl.org     106
9     Ensembl 105 Dec 2021 https://dec2021.archive.ensembl.org     105
10    Ensembl 104 May 2021 https://may2021.archive.ensembl.org     104
11    Ensembl 103 Feb 2021 https://feb2021.archive.ensembl.org     103
12    Ensembl 102 Nov 2020 https://nov2020.archive.ensembl.org     102
13    Ensembl 101 Aug 2020 https://aug2020.archive.ensembl.org     101
14    Ensembl 100 Apr 2020 https://apr2020.archive.ensembl.org     100
15     Ensembl 99 Jan 2020 https://jan2020.archive.ensembl.org      99
16     Ensembl 98 Sep 2019 https://sep2019.archive.ensembl.org      98
17     Ensembl 97 Jul 2019 https://jul2019.archive.ensembl.org      97
18     Ensembl 80 May 2015 https://may2015.archive.ensembl.org      80
19     Ensembl 77 Oct 2014 https://oct2014.archive.ensembl.org      77
20     Ensembl 75 Feb 2014 https://feb2014.archive.ensembl.org      75
21     Ensembl 54 May 2009 https://may2009.archive.ensembl.org      54
   current_release
AR-Shicheng commented 1 month ago

Is there a way for us to build ENSEMBLE 88 ourselves? If so, how can we do it?

grimbough commented 1 month ago

Ensembl keeps each release available for 5 years. A few selected releases are retained for longer, but in most cases once 5 years has passed it is deemed out of date and removed. Ensembl 88 is from May 2017 and was removes ~ 2 years ago. There are some more details on the archive policies at https://www.ensembl.org/info/website/archives/index.html

biomaRt is only an interface to query to databases Ensembl makes available, and so you can't access release 88.

In theory you could potentially build your own version from the original source data, available from https://ftp.ensembl.org/pub/release-88/ However I don't think Ensembl provide any instructions on how to do this and it will be a very difficult task.

I would ask why using such an old version is important. If there's a really good reason, maybe you can get the information you need from those files on the FTP site, rather than using BioMart. If not, then perhaps using a more recent version of the annotation data would be fine.

AR-Shicheng commented 1 month ago

Dear Mike,

If you could prepare an ENSEMBLE 88 dataset, it would provide tremendous help to the community. As you may know, the GTEx data is crucial for our research, and their results, particularly at the transcript level, are based on ENSEMBLE 88 and have not been updated to the latest ENSEMBLE versions.

GTEx Portal Information

Without ENSEMBLE 88, many analyses might face significant issues, including conflicts or misleading results, which could lead to serious reproducibility concerns.

Best regards,

Shicheng


From: Mike Smith @.> Sent: Tuesday, July 23, 2024 2:00 AM To: grimbough/biomaRt @.> Cc: Shicheng Guo @.>; Author @.> Subject: Re: [grimbough/biomaRt] ENSEMBLE 88 (Issue #107)

Ensembl keeps each release available for 5 years. A few selected releases are retained for longer, but in most cases once 5 years has passed it is deemed out of date and removed. Ensembl 88 is from May 2017 and was removes ~ 2 years ago. There are some more details on the archive policies at https://www.ensembl.org/info/website/archives/index.html

biomaRt is only an interface to query to databases Ensembl makes available, and so you can't access release 88.

In theory you could potentially build your own version from the original source data, available from https://ftp.ensembl.org/pub/release-88/ However I don't think Ensembl provide any instructions on how to do this and it will be a very difficult task.

I would ask why using such an old version is important. If there's a really good reason, maybe you can get the information you need from those files on the FTP site, rather than using BioMart. If not, then perhaps using a more recent version of the annotation data would be fine.

— Reply to this email directly, view it on GitHubhttps://github.com/grimbough/biomaRt/issues/107#issuecomment-2244649254, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BG5ET37YTKEAUWKEW64FH5TZNYLRZAVCNFSM6AAAAABLDNBMZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBUGY2DSMRVGQ. You are receiving this because you authored the thread.

grimbough commented 1 month ago

I'm not going to create my own instance of BioMart. I don't work for Ensembl, nor do I have the time or resources to maintain my own BioMart server.

However, you could potentially use another source of annotation in Bioconductor. The ensembldb packages (https://bioconductor.org/packages/release/bioc/html/ensembldb.html) let you download snapshots of each Ensembl release to work with locally.

BiocManager::install('AnnotationHub')
ah <- AnnotationHub::AnnotationHub()

## search for the Human Ensembl 88 database
query(ah, pattern = c("Ensembl 88", "Sapiens"))

AnnotationHub with 1 record
# snapshotDate(): 2024-04-30
# names(): AH53715
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2017-04-05
# $title: Ensembl 88 EnsDb for Homo Sapiens
# $description: Gene and protein annotations for Homo Sapiens based on Ensembl version 88.
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("EnsDb", "Ensembl", "Gene", "Transcript", "Protein", "Annotation", "88", "AHEnsDbs") 
# retrieve record with 'object[["AH53715"]]' 

## This finds only one record, and gives instruction is on how to retrieve it
## Downloading might take quite a while
ens_88 <- ah[["AH53715"]]
ens_88
# EnsDb for Ensembl:
# |Backend: SQLite
# |Db type: EnsDb
# |Type of Gene ID: Ensembl Gene ID
# |Supporting package: ensembldb
# |Db created by: ensembldb package from Bioconductor
# |script_version: 0.3.1
# |Creation time: Thu Jun 15 08:50:24 2017
# |ensembl_version: 88
# |ensembl_host: localhost
# |Organism: homo_sapiens
# |taxonomy_id: 9606
# |genome_build: GRCh38
# |DBSCHEMAVERSION: 2.1
# | No. of genes: 64592.
# | No. of transcripts: 219063.
# |Protein data available.

You'll need to look at the manual for ensembldb to figure out how to work with that object and extract the data you want, but it should match the Ensembl release you want to work with.