Closed alanwilter closed 2 years ago
@pontikos I'm dealing with gene
. In old schema we have xstart
and xstop
. They are not in the schema but, just for curiosity, what are they since the values there are usually different from start
and stop
.
@pontikos My current query for genes in the new schema is that:
-- gene new schema
-- full_gene_name -> description
-- gene_id -> ensembl_gene_id
-- xstart, xstop: GONE
-- chrom -> chromosome
-- stop -> end
-- gene_name_upper -> hgnc_symbol.UPPER
-- gene_name -> hgnc_symbol
-- canonical_transcript (e.g. ENST00000536175): to be RESOLVED
select
--gs.gene,
array_agg(distinct gs.external_synonym order by gs.external_synonym) as other_names,
--g.identifier,
g.ensembl_gene_id, g."version", g.description, g.chromosome, g."start", g."end", g.strand, g.band, g.biotype, g.hgnc_id, g.hgnc_symbol, g.percentage_gene_gc_content, g.assembly
from ensembl.gene g
join ensembl.gene_synonym gs on gs.gene = g.identifier
where g.chromosome ~ '^X|^Y|^[0-9]{1,2}' and g.assembly = 'GRCh37'
and g.ensembl_gene_id = 'ENSG00000114999'
--and g.hgnc_symbol = 'TTLL5'
group by
g.ensembl_gene_id, g."version", g.description, g.chromosome, g."start", g."end", g.strand, g.band, g.biotype, g.hgnc_id, g.hgnc_symbol, g.percentage_gene_gc_content, g.assembly
;
We're going to have new attributes as see above but there one missing (possibly more): canonical_transcript
.
I have several tables in the ensembl
schema related to transcript
but I'm wondering which attributes you'd like to have @pontikos.
ensembl.gene_transcript
gene | transcript |
---|---|
1 | 1 |
ensembl.transcript
identifier | ensembl_gene_id | ensembl_transcript_id | version | ensembl_peptide_id | peptide_version | chromosome | start | end | transcription_start_site | strand | transcript_length | cds_length | biotype | uniparc | assembly | canonical |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ENSG00000261657 | ENST00000566782 | 1 | ENSP00000456546 | 1 | HG991_PATCH | 66119285 | 66456619 | 66119285 | 1 | 2673 | 825 | protein_coding | UPI000003615A | GRCh37 | true |
ensembl.transcript_exon
transcript | exon |
---|---|
1 | 1 |
ensembl.transcript_uniprot
transcript | uniprotswissprot |
---|---|
2 | P35070 |
I would take at least ENSP
, ENST
and UniProtID
. But look at the table, especially ensembl.transcript
and tell me if you'd like anything else.
Files that need work:
views/gene.py
views/autocomplete.py
views/statistics.py