Open julie-sullivan opened 4 years ago
Here is a data file:
+++++++++++++++++++++++++
Stable ID to uniprot Ac
+++++++++++++++++++++++++
Provides mappings from Gene, Transcript and Translation stable identifiers to
uniprot accessions with reports as to the % identity of the hit where
applicable. Dumps contain all Ensembl exeternal database names which started
with UniProt so duplication of hits is possible.
ftp://ftp.ensembl.org/pub/grch37/release-98/tsv/homo_sapiens/
gene_stable_id transcript_stable_id protein_stable_id xref db_name info_type source_identity xref_identity linkage_type
ENSG00000186092 ENST00000335137 ENSP00000334393 Q8NH21 Uniprot/SWISSPROT DIRECT - - -
ENSG00000237683 ENST00000423372 ENSP00000473460 B7Z7W4 Uniprot/SPTREMBL SEQUENCE_MATCH 56 100 -
ENSG00000237683 ENST00000423372 ENSP00000473460 R4GN28 Uniprot/SPTREMBL SEQUENCE_MATCH 100 100 -
species taxid gene_stable_id transcript_stable_id protein_stable_id primary_accession secondary_accession
Homo_sapiens 9606 ENSG00000000003 ENST00000373020 ENSP00000362111 CM000685 AAC69710
Homo_sapiens 9606 ENSG00000000003 ENST00000496771 CM000685
From reading the Ensembl docs, I think the Perl API is the best Ensembl have to offer, given the data we need.
I don't see a docker, only a VM https://www.ensembl.org/info/data/virtual_machine.html
Here are the mysql files:
http://ftp.ensemblorg.ebi.ac.uk/pub/release-89/mysql/homo_sapiens_core_89_38/
Which tables do we need?
FYI the protein_function_predictions
table takes about an hour to download:
http://ftp.ensemblorg.ebi.ac.uk/pub/current_mysql/homo_sapiens_variation_98_38/protein_function_predictions.txt.gz
OR we could use the REST API:
https://rest.ensembl.org/
GET xrefs/symbol/:species/:symbol
- xrefs
But I don't see a way to get sift/polyphen scores or cytoband info.
BioMart has the data too. Would it be easier to use than the Ensembl Perl API?
# An example script demonstrating the use of BioMart API.
# This perl API representation is only available for configuration versions >= 0.5
use strict;
use BioMart::Initializer;
use BioMart::Query;
use BioMart::QueryRunner;
my $confFile = "PATH TO YOUR REGISTRY FILE UNDER biomart-perl/conf/. For Biomart Central Registry navigate to
http://www.biomart.org/biomart/martservice?type=registry";
#
# NB: change action to 'clean' if you wish to start a fresh configuration
# and to 'cached' if you want to skip configuration step on subsequent runs from the same registry
#
my $action='cached';
my $initializer = BioMart::Initializer->new('registryFile'=>$confFile, 'action'=>$action);
my $registry = $initializer->getRegistry;
my $query = BioMart::Query->new('registry'=>$registry,'virtualSchemaName'=>'default');
$query->setDataset("hsapiens_gene_ensembl");
$query->addAttribute("ensembl_gene_id");
$query->addAttribute("ensembl_gene_id_version");
$query->addAttribute("ensembl_transcript_id");
$query->addAttribute("ensembl_transcript_id_version");
$query->addAttribute("variation_name");
$query->addAttribute("polyphen_score_2076");
$query->addAttribute("polyphen_prediction_2076");
$query->addAttribute("sift_prediction_2076");
$query->addAttribute("sift_score_2076");
$query->addAttribute("peptide_location");
$query->formatter("TSV");
my $query_runner = BioMart::QueryRunner->new();
############################## GET COUNT ############################
# $query->count(1);
# $query_runner->execute($query);
# print $query_runner->getCount();
#####################################################################
############################## GET RESULTS ##########################
# to obtain unique rows only
# $query_runner->uniqueRowsOnly(1);
$query_runner->execute($query);
$query_runner->printHeader();
$query_runner->printResults();
$query_runner->printFooter();
#####################################################################
According to DEPRECATED.md, fetch_all_by_external_name()
was removed in ensembl v92
Bio::EnsEMBL::Funcgen::DBSQL::ProbeSetAdaptor::fetch_all_by_external_name()
This is the error I get:
root@f8f8a1fd18a4:/opt# ./gene_extra_info.pl
Directory '/tmp/Homo sapiens' does not exist, creating directory...
In vertebrates section
Human selected, assembly GRCh37 selected, connecting to port 3337
Can't locate object method "fetch_all_by_external_name" via package
"Bio::EnsEMBL::Funcgen::DBSQL::ProbeSetAdaptor" at ./gene_extra_info.pl line 220.
According to the API, we should use:
fetch_all_by_transcript_stable_id()
my $probe_set_list = $probe_set_adaptor->fetch_all_by_transcript_stable_id('ENST00000489935');
Description: Fetches all probe_sets that have been mapped to this transcript
by the probe2transcript step in the probemapping pipeline.
There is also fetch_all_by_name('ProbeSet1')};
, which sounds close but accepts probesets instead of transcripts.
Ensembl do not plan to create a docker image for the Perl API:
Hi Julie
Many thanks for your message.
I am afraid we do not plan to create a docker image.
If you let me know what you are trying to do and at what scale, I should be
able to advise on the best approach.
I hope this helps, but please do not hesitate to contact us again.
Best wishes
Astrid
Ensembl helpdesk
sift/polyphen scores look okay:
root@277df85f6c27:/opt# ./protein_function_prediction_matrices.pl
Connecting...
Human selected, assembly GRCh37 selected, connecting to port 3337
Retrieving Chromosomes: 22,
4459 transcripts fetched!
and the cytobands script ran quickly, and the JSON looks correct.
The data is available via REST
The /info/assembly endpoints can provide cytoband information, for example http://rest.ensembl.org/info/assembly/homo_sapiens/X?content-type=application/json;bands=1
Sift and Polyphen scores are provided by default on all the VEP endpoints, for example http://rest.ensembl.org/vep/human/id/rs56116432?content-type=application/json
as well as on the /overlap/translation endpoint when variants are requested, for example http://rest.ensembl.org/overlap/translation/ENSP00000288602?type=missense_variant;content-type=application/json;feature=transcript_variation
Is it complete? Good enough?
With regards to cytobands retrieval using the Ensembl REST API, I have in the past written a short R script for that:
https://github.com/ramiromagno/gwasrapidd/blob/master/data-raw/cytogenetic_bands.R
This is the tidy dataset generated: https://github.com/ramiromagno/gwasrapidd/blob/master/data-raw/cytogenetic_bands.csv
Script status