phenopolis / phenopolis_genomics_browser

Python API and React frontend for the Phenopolis Genomics Browser
https://dev-live.phenopolis.org
MIT License
31 stars 2 forks source link

Refactoring API for using DB new schema #371

Closed alanwilter closed 2 years ago

alanwilter commented 3 years ago

Files that need work:

alanwilter commented 3 years ago

@pontikos I'm dealing with gene. In old schema we have xstart and xstop. They are not in the schema but, just for curiosity, what are they since the values there are usually different from start and stop.

alanwilter commented 3 years ago

@pontikos My current query for genes in the new schema is that:

-- gene new schema
-- full_gene_name -> description
-- gene_id -> ensembl_gene_id
-- xstart, xstop: GONE
-- chrom -> chromosome
-- stop -> end
-- gene_name_upper -> hgnc_symbol.UPPER
-- gene_name -> hgnc_symbol
-- canonical_transcript (e.g. ENST00000536175): to be RESOLVED
select 
--gs.gene, 
array_agg(distinct gs.external_synonym order by gs.external_synonym) as other_names,
--g.identifier, 
g.ensembl_gene_id, g."version", g.description, g.chromosome, g."start", g."end", g.strand, g.band, g.biotype, g.hgnc_id, g.hgnc_symbol, g.percentage_gene_gc_content, g.assembly
from ensembl.gene g
join ensembl.gene_synonym gs on gs.gene = g.identifier 
where g.chromosome ~ '^X|^Y|^[0-9]{1,2}' and g.assembly = 'GRCh37'
and g.ensembl_gene_id = 'ENSG00000114999'
--and g.hgnc_symbol = 'TTLL5'
group by
g.ensembl_gene_id, g."version", g.description, g.chromosome, g."start", g."end", g.strand, g.band, g.biotype, g.hgnc_id, g.hgnc_symbol, g.percentage_gene_gc_content, g.assembly
;

We're going to have new attributes as see above but there one missing (possibly more): canonical_transcript.

I have several tables in the ensembl schema related to transcript but I'm wondering which attributes you'd like to have @pontikos.

gene transcript
1 1
identifier ensembl_gene_id ensembl_transcript_id version ensembl_peptide_id peptide_version chromosome start end transcription_start_site strand transcript_length cds_length biotype uniparc assembly canonical
1 ENSG00000261657 ENST00000566782 1 ENSP00000456546 1 HG991_PATCH 66119285 66456619 66119285 1 2673 825 protein_coding UPI000003615A GRCh37 true
transcript exon
1 1
transcript uniprotswissprot
2 P35070

I would take at least ENSP, ENST and UniProtID. But look at the table, especially ensembl.transcript and tell me if you'd like anything else.