opencb / cellbase

High-Performance NoSQL database and RESTful web services to access to most relevant biological data
Apache License 2.0
89 stars 53 forks source link

Can we replace Ensembl Perl API with something else? #437

Open julie-sullivan opened 4 years ago

julie-sullivan commented 4 years ago
solution pros cons
Perl API Scripts already written, use docker instead of installing perl modules Have to maintain scripts
MySQL Don't have to rely on API Database is too big, have to maintain SQL queries
MySQL tables don't have to load entire database Files are huge, schema might change
BioMart has all the data unreliable
REST API no libraries, can write nice Java client not all data is present
Data file easiest the data we need isn't present in data files

Script status

julie-sullivan commented 4 years ago

Here is a data file:

+++++++++++++++++++++++++
Stable ID to uniprot Ac
+++++++++++++++++++++++++

Provides mappings from Gene, Transcript and Translation stable identifiers to 
uniprot accessions with reports as to the % identity of the hit where 
applicable. Dumps contain all Ensembl exeternal database names which started
with UniProt so duplication of hits is possible.

ftp://ftp.ensembl.org/pub/grch37/release-98/tsv/homo_sapiens/

gene_stable_id  transcript_stable_id    protein_stable_id   xref    db_name info_type   source_identity xref_identity   linkage_type
ENSG00000186092 ENST00000335137 ENSP00000334393 Q8NH21  Uniprot/SWISSPROT   DIRECT  -   -   -
ENSG00000237683 ENST00000423372 ENSP00000473460 B7Z7W4  Uniprot/SPTREMBL    SEQUENCE_MATCH  56  100 -
ENSG00000237683 ENST00000423372 ENSP00000473460 R4GN28  Uniprot/SPTREMBL    SEQUENCE_MATCH  100 100 -
species taxid   gene_stable_id  transcript_stable_id    protein_stable_id   primary_accession   secondary_accession
Homo_sapiens    9606    ENSG00000000003 ENST00000373020 ENSP00000362111 CM000685    AAC69710
Homo_sapiens    9606    ENSG00000000003 ENST00000496771     CM000685    
julie-sullivan commented 4 years ago

From reading the Ensembl docs, I think the Perl API is the best Ensembl have to offer, given the data we need.

julie-sullivan commented 4 years ago

I don't see a docker, only a VM https://www.ensembl.org/info/data/virtual_machine.html

julie-sullivan commented 4 years ago

Here are the mysql files:

http://ftp.ensemblorg.ebi.ac.uk/pub/release-89/mysql/homo_sapiens_core_89_38/

Which tables do we need?


FYI the protein_function_predictions table takes about an hour to download:

http://ftp.ensemblorg.ebi.ac.uk/pub/current_mysql/homo_sapiens_variation_98_38/protein_function_predictions.txt.gz
julie-sullivan commented 4 years ago

OR we could use the REST API:

https://rest.ensembl.org/ GET xrefs/symbol/:species/:symbol - xrefs

But I don't see a way to get sift/polyphen scores or cytoband info.

julie-sullivan commented 4 years ago

BioMart has the data too. Would it be easier to use than the Ensembl Perl API?


# An example script demonstrating the use of BioMart API.
# This perl API representation is only available for configuration versions >=  0.5 
use strict;
use BioMart::Initializer;
use BioMart::Query;
use BioMart::QueryRunner;

my $confFile = "PATH TO YOUR REGISTRY FILE UNDER biomart-perl/conf/. For Biomart Central Registry navigate to
                        http://www.biomart.org/biomart/martservice?type=registry";
#
# NB: change action to 'clean' if you wish to start a fresh configuration  
# and to 'cached' if you want to skip configuration step on subsequent runs from the same registry
#

my $action='cached';
my $initializer = BioMart::Initializer->new('registryFile'=>$confFile, 'action'=>$action);
my $registry = $initializer->getRegistry;

my $query = BioMart::Query->new('registry'=>$registry,'virtualSchemaName'=>'default');

    $query->setDataset("hsapiens_gene_ensembl");
    $query->addAttribute("ensembl_gene_id");
    $query->addAttribute("ensembl_gene_id_version");
    $query->addAttribute("ensembl_transcript_id");
    $query->addAttribute("ensembl_transcript_id_version");
    $query->addAttribute("variation_name");
    $query->addAttribute("polyphen_score_2076");
    $query->addAttribute("polyphen_prediction_2076");
    $query->addAttribute("sift_prediction_2076");
    $query->addAttribute("sift_score_2076");
    $query->addAttribute("peptide_location");

$query->formatter("TSV");

my $query_runner = BioMart::QueryRunner->new();
############################## GET COUNT ############################
# $query->count(1);
# $query_runner->execute($query);
# print $query_runner->getCount();
#####################################################################

############################## GET RESULTS ##########################
# to obtain unique rows only
# $query_runner->uniqueRowsOnly(1);

$query_runner->execute($query);
$query_runner->printHeader();
$query_runner->printResults();
$query_runner->printFooter();
#####################################################################
julie-sullivan commented 4 years ago

According to DEPRECATED.md, fetch_all_by_external_name() was removed in ensembl v92

Bio::EnsEMBL::Funcgen::DBSQL::ProbeSetAdaptor::fetch_all_by_external_name()

This is the error I get:

root@f8f8a1fd18a4:/opt# ./gene_extra_info.pl
Directory '/tmp/Homo sapiens' does not exist, creating directory...
In vertebrates section
Human selected, assembly GRCh37 selected, connecting to port 3337
Can't locate object method "fetch_all_by_external_name" via package 
"Bio::EnsEMBL::Funcgen::DBSQL::ProbeSetAdaptor" at ./gene_extra_info.pl line 220.

According to the API, we should use:

fetch_all_by_transcript_stable_id()

 my $probe_set_list = $probe_set_adaptor->fetch_all_by_transcript_stable_id('ENST00000489935');

  Description: Fetches all probe_sets that have been mapped to this transcript 
  by the probe2transcript step in the probemapping pipeline.

There is also fetch_all_by_name('ProbeSet1')};, which sounds close but accepts probesets instead of transcripts.

julie-sullivan commented 4 years ago

Ensembl do not plan to create a docker image for the Perl API:

Hi Julie

Many thanks for your message.

I am afraid we do not plan to create a docker image.

If you let me know what you are trying to do and at what scale, I should be
able to advise on the best approach.

I hope this helps, but please do not hesitate to contact us again.

Best wishes
Astrid
Ensembl helpdesk​
julie-sullivan commented 4 years ago

sift/polyphen scores look okay:

root@277df85f6c27:/opt# ./protein_function_prediction_matrices.pl 
Connecting...
Human selected, assembly GRCh37 selected, connecting to port 3337
Retrieving Chromosomes: 22, 
4459 transcripts fetched!
julie-sullivan commented 4 years ago

and the cytobands script ran quickly, and the JSON looks correct.

julie-sullivan commented 4 years ago

The data is available via REST

The /info/assembly endpoints can provide cytoband information, for example http://rest.ensembl.org/info/assembly/homo_sapiens/X?content-type=application/json;bands=1

Sift and Polyphen scores are provided by default on all the VEP endpoints, for example http://rest.ensembl.org/vep/human/id/rs56116432?content-type=application/json
as well as on the /overlap/translation endpoint when variants are requested, for example http://rest.ensembl.org/overlap/translation/ENSP00000288602?type=missense_variant;content-type=application/json;feature=transcript_variation

Is it complete? Good enough?

julie-sullivan commented 4 years ago
With regards to cytobands retrieval using the Ensembl REST API, I have in the past written a short R script for that:

https://github.com/ramiromagno/gwasrapidd/blob/master/data-raw/cytogenetic_bands.R

This is the tidy dataset generated: https://github.com/ramiromagno/gwasrapidd/blob/master/data-raw/cytogenetic_bands.csv