Refactoring API for using DB new schema

alanwilter commented 3 years ago

Files that need work:

[x] views/gene.py
[x] views/autocomplete.py
[x] views/statistics.py

alanwilter commented 3 years ago

@pontikos I'm dealing with gene. In old schema we have xstart and xstop. They are not in the schema but, just for curiosity, what are they since the values there are usually different from start and stop.

alanwilter commented 3 years ago

@pontikos My current query for genes in the new schema is that:

-- gene new schema
-- full_gene_name -> description
-- gene_id -> ensembl_gene_id
-- xstart, xstop: GONE
-- chrom -> chromosome
-- stop -> end
-- gene_name_upper -> hgnc_symbol.UPPER
-- gene_name -> hgnc_symbol
-- canonical_transcript (e.g. ENST00000536175): to be RESOLVED
select 
--gs.gene, 
array_agg(distinct gs.external_synonym order by gs.external_synonym) as other_names,
--g.identifier, 
g.ensembl_gene_id, g."version", g.description, g.chromosome, g."start", g."end", g.strand, g.band, g.biotype, g.hgnc_id, g.hgnc_symbol, g.percentage_gene_gc_content, g.assembly
from ensembl.gene g
join ensembl.gene_synonym gs on gs.gene = g.identifier 
where g.chromosome ~ '^X|^Y|^[0-9]{1,2}' and g.assembly = 'GRCh37'
and g.ensembl_gene_id = 'ENSG00000114999'
--and g.hgnc_symbol = 'TTLL5'
group by
g.ensembl_gene_id, g."version", g.description, g.chromosome, g."start", g."end", g.strand, g.band, g.biotype, g.hgnc_id, g.hgnc_symbol, g.percentage_gene_gc_content, g.assembly
;

We're going to have new attributes as see above but there one missing (possibly more): canonical_transcript.

I have several tables in the ensembl schema related to transcript but I'm wondering which attributes you'd like to have @pontikos.

ensembl.gene_transcript

gene	transcript
1	1

ensembl.transcript

identifier	ensembl_gene_id	ensembl_transcript_id	version	ensembl_peptide_id	peptide_version	chromosome	start	end	transcription_start_site	strand	transcript_length	cds_length	biotype	uniparc	assembly	canonical
1	ENSG00000261657	ENST00000566782	1	ENSP00000456546	1	HG991_PATCH	66119285	66456619	66119285	1	2673	825	protein_coding	UPI000003615A	GRCh37	true

ensembl.transcript_exon

transcript	exon
1	1

ensembl.transcript_uniprot

transcript	uniprotswissprot
2	P35070

I would take at least ENSP, ENST and UniProtID. But look at the table, especially ensembl.transcript and tell me if you'd like anything else.

phenopolis / phenopolis_genomics_browser

Refactoring API for using DB new schema #371