phenopolis / phenopolis_genomics_browser

Python API and React frontend for the Phenopolis Genomics Browser
https://dev-live.phenopolis.org
MIT License
31 stars 2 forks source link

Question about Variant in the schema and phenopolis.variant_gene #339

Closed alanwilter closed 3 years ago

alanwilter commented 3 years ago

@pontikos Working on a new query for variant using the new schema and I come across this situation, so why do we have that?:

select concat(v.chrom,'-',v.pos,'-',v."ref",'-',v.alt), vg.* 
from phenopolis.variant_gene vg
join phenopolis.variant v on v.id = vg.variant_id 
and v.chrom = '12' and v.pos = 7241974 and v."ref" = 'C' and v.alt = 'T';
Variant gene_id variant_id transcript_id strand exon most_severe_consequence impact hgvs_c hgvs_p canonical
12-7241974-C-T ENSG00000139178 3,115,048 ENST00000545280 -1 [NULL] missense_variant modifier [NULL] [NULL] [NULL]
12-7241974-C-T ENSG00000159403 3,115,048 ENST00000542285 -1 5/11 missense_variant moderate ENST00000542285.1:c.680G>A ENSP00000438615.1:p.Arg227Gln true

Should we use just the canonical one?

alanwilter commented 3 years ago

And there's also the other side of this coin. Taking another example, variant 7-2303057-G-A:

CHROM POS ID REF ALT AF AC AN HET_COUNT HOM_COUNT DP FS MLEAC MLEAF MQ FILTER HET HOM most_severe_consequence af_kaviar af_gnomad_genomes af_jirdc af_tommo af_krgdb af_converge af_hgvd gene_symbol hgvsc hgvsp dann cadd_phred gene_id variant_id
7 2303057 . G A 0.00013 1 9148 1 0 98640 NA NA NA NA PASS ["PH00008256"] [] intron_variant 0.0006824 6.45328e-05 0 0 0 0 0 SNX8 ENST00000222990.3:c.783-60C>T 0.54 0.121 ENSG00000106266 3788690
select v.*  from phenopolis.variant v 
where v.chrom = '7' and v.pos = 2303057 and v."ref" = 'G' and v.alt = 'A';
chrom pos ref alt id
7 2303057 G A 1435584

if I join to phenopolis.variant_gene, it will return nothing, unless I do a left outer join:

select concat(v.chrom,'-',v.pos,'-',v."ref",'-',v.alt), vg.* 
from phenopolis.variant v 
left outer 
join phenopolis.variant_gene vg on v.id = vg.variant_id 
where v.chrom = '7' and v.pos = 2303057 and v."ref" = 'G' and v.alt = 'A';

yet, full of NULL cells:

concat gene_id variant_id transcript_id strand exon most_severe_consequence impact hgvs_c hgvs_p canonical
7-2303057-G-A

So, while phenopolis.variant has 4,784,386 rows, phenopolis.variant_gene has only 968,599 rows. This essentially means that, e.g., https://dev-live.phenopolis.org/individual/PH00008256, one may see 1350 rows in RARE VARIANT tab, using the old schema, in the new schema, because of the limitations in phenopolis.variant_gene one may only see 4 rows*:

zygosity genes CHROM POS REF ALT cadd_phred dann fathmm_score revel most_severe_consequence hgvsc hgvsp DP FS MQ FILTER HOM HET af_kaviar af_gnomad_genomes
HET {PTCHD2@ENSG00000204624} 1 11586591 C T intron_variant ENST00000294484.6:c.2614-117C>T {PH00002451,PH00006967,PH00008256} 0.0006824
HET {PPARGC1A@ENSG00000109819} 4 23830299 G A intron_variant ENST00000264867.2:c.553-72C>T {PH00000502} {PH00000168,PH00002486,PH00002868,PH00004809,PH00008256} 0.0013648
HET {C1R@ENSG00000159403,C1RL@ENSG00000139178} 12 7241974 C T missense_variant ENST00000542285.1:c.680G>A ENSP00000438615.1:p.Arg227Gln {PH00008256,PH00008642} 0.0000129
HET {PRPF31@ENSG00000105618} 19 54625343 G A intron_variant ENST00000321030.4:c.322+21G>A {PH00002579,PH00003057} {PH00001950,PH00002142,PH00002451,PH00002743,PH00008256} 0.0013826
Tentative query for variants in the new schema: ```sql select iv.zygosity, array_agg(distinct concat(g.hgnc_symbol,'@',g.ensembl_gene_id)) as genes, -- gene_symbol, gene_id v.chrom as "CHROM", v.pos as "POS", v."ref" as "REF", v.alt as "ALT", v.cadd_phred, v.dann, v.fathmm_score, v.revel, -- new added -- removed: v.id vg.most_severe_consequence, string_agg(distinct vg.hgvs_c,',' order by vg.hgvs_c) as hgvsc, string_agg(distinct vg.hgvs_p,',' order by vg.hgvs_p) as hgvsp, -- via variant_gene iv.dp as "DP", iv."fs" as "FS", iv.mq as "MQ", iv."filter" as "FILTER", -- via individual_variant ( select array_agg(i.phenopolis_id order by i.id) from phenopolis.individual i join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HOM' where v.id = iv2.variant_id ) as "HOM", ( select array_agg(i.phenopolis_id order by i.id) from phenopolis.individual i join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HET' where v.id = iv2.variant_id ) as "HET", ( select distinct on (ah.chrom,ah.pos,ah."ref",ah.alt) ah.af from kaviar.annotation_hg19 ah where ah.chrom = v.chrom and ah.pos = v.pos and ah."ref" = v."ref" and ah.alt = v.alt order by ah.chrom,ah.pos,ah."ref",ah.alt,ah.ac desc ) as af_kaviar, av.af as af_gnomad_genomes -- gnomad -- deprecated: MLEAF, MLEAC -- need to be added (by Daniele): af_converge, af_hgvd, af_jirdc, af_krgdb, af_tommo, from phenopolis.variant v join phenopolis.individual_variant iv on iv.variant_id = v.id join phenopolis.individual i2 on i2.id = iv.individual_id --left outer join phenopolis.variant_gene vg on vg.variant_id = v.id --and vg.canonical-- variant_gene not complete? --left outer join ensembl.gene g on vg.gene_id = g.ensembl_gene_id and g.assembly = 'GRCh37' and g.chromosome ~ '^X|^Y|^[0-9]{1,2}' left outer join gnomad.annotation_v3 av on av.chrom = v.chrom and av.pos = v.pos and av."ref" = v."ref" and av.alt = v.alt --where v.chrom = '12' and v.pos = 7241974 and v."ref" = 'C' and v.alt = 'T' -- 2 rows --where v.chrom = '7' and v.pos = 2303057 and v."ref" = 'G' and v.alt = 'A' -- 0 row where i2.phenopolis_id = 'PH00008256' --where vg.gene_id = 'ENSG00000144285' -- test with gene_id group by zygosity,"CHROM","POS","REF","ALT",cadd_phred,dann,fathmm_score,revel,most_severe_consequence,"DP","FS","MQ","FILTER","HOM","HET",af_kaviar,af_gnomad_genomes order by iv.zygosity desc, substring(v.chrom FROM '([0-9]+)')::int, v.pos, v."ref", v.alt ; ```

Hence, the 2nd question: Should we show only what we currently have or should we try to replicate as much as possible the old schema (then I'd add left outer to the query)?

alanwilter commented 3 years ago

Closing it, there's not really an issue here.