phenopolis / phenopolis_genomics_browser

Python API and React frontend for the Phenopolis Genomics Browser
https://dev-live.phenopolis.org
MIT License
31 stars 2 forks source link

some patients ids are in gene tables but do not exist #329

Open pontikos opened 3 years ago

pontikos commented 3 years ago

example if you go to gene

ENSG00000140505

then patient

PH0000043392

is listed in the variants but does not exist in patients table.

I'm guessing new schema will fix this as the gene API call will cross reference with the patients tables via the individual_variants table.

alanwilter commented 3 years ago

gene.py still uses the old schema and though we have NewGene db.model, the refactoring of this module will require several other linked tables.

And indeed, the data is wrong in the old schema. New schema should solve this.

alanwilter commented 3 years ago

@pontikos there's something I need to understand: for a given gene, get the variants associated BUT, e.g.: https://dev-live.phenopolis.org/gene/ENSG00000140505

BTW, I found redundant for the backend api to return the col genes for the variants objs.

pontikos commented 3 years ago

I think that's right the demo account should only be able to see 2-3 genes and only the see the variants ids not the individual ids.

alanwilter commented 3 years ago

For demo I can understand but and for other users? Using your example:

--- for a gene, get the variants 
select distinct ui."user", v.id from phenopolis.variant v 
join phenopolis.variant_gene vg on vg.variant_id = v.id
join phenopolis.individual_variant iv on iv.variant_id = vg.variant_id 
join phenopolis.individual i on i.id = iv.individual_id
join public.users_individuals ui on ui.internal_id = i.phenopolis_id and ui."user" <> 'Admin'
where vg.gene_id = 'ENSG00000140505'
;

Admin can see 28 variants, but other users would see just a few:

user        id
AliceDavidson   3518137
AliceDavidson   3518144
AliceDavidson   3518185
AliceDavidson   3518187
AliceDavidson   3518202
AliceDavidson   3518322
Tian        3518202
Tian        10981748
TonySegal   3518176
TS      3518176
UKIRDC      3518269

I'm still wondering if any other user, except for demo, should be able to see all variants. I would just be careful to show in HOM and HET cols only the patients the user is allowed to see.

pontikos commented 3 years ago

I see your point. On the gene page, users other than demo should be able to see all the variants (even those of patients they do not own). However they cannot go to the patient page of a patient they do not own.

alanwilter commented 3 years ago

Just to clarify, user "John_doe" would see all variants but won't see any of those patients it doesn't own, so not even a chip link will be shown.

pontikos commented 3 years ago

The way it works at the moment is that the patient chip is shown but the preivew you get says "you are not allowed to view this patient". And you cannot click through to the patient page.

However in the gene table all variants from all patients are shown, it's just that you can't information about the patients that you do not own, if that makes sense?

alanwilter commented 3 years ago

I tested that, the trouble is if you click on one of the patients you can't access, it takes you back to dashboard and you loose what you're doing. I thinks it's better omit the patients you can't access and yet see all the variants.

I'm collecting all the elements (columns), for the variants, that you have in the old schema and try to rebuild with new schema. So far I couldn't find, in the new schema: AF, AC, AN, HET, HOM, MLEAC, MLEAF, af_jirdc, af_tommo, af_krgdb, af_converge, af_hgvd, old ID On the other hand, in the new phenopolis.variant I have extra: new ID, dbsnp, variant_class, revel, fathom_score and in phenopolis.individual_variant, I have: zygosity, status, clinvar_id, pubmed_id, qd, comment I'm wondering which elements do you want to see the gene page to compose the columns for the variants associated.

I'm fetching data from kaviar and nomad tables and yet I have cadd table, old schema had no reference to cadd. Should we add cadd? Just af_cadd for example?

alanwilter commented 3 years ago

I see I can use phenopolis.individual_variant.zigosity to build HET and HOM.

alanwilter commented 3 years ago

@pontikos this is the query the working the best so far using the new schema:

select distinct 
v.*, iv.dp, iv."fs", iv.mq, iv.qd, iv."filter", --, iv.zygosity -- add individual_variant
vg.hgvs_c, vg.hgvs_p, vg.most_severe_consequence, vg.impact, vg.canonical, -- add variant_gene
(
    select array_agg(i.phenopolis_id)
    from phenopolis.individual i
    join public.users_individuals ui on ui.internal_id = i.phenopolis_id
    join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HOM'
    where ui."user" = 'demo'
    and vg.variant_id = iv2.variant_id 
) as HOM,
(
    select array_agg(i.phenopolis_id)
    from phenopolis.individual i
    join public.users_individuals ui on ui.internal_id = i.phenopolis_id
    join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HET'
    where ui."user" = 'demo'
    and vg.variant_id = iv2.variant_id 
) as HET
from phenopolis.variant v 
join phenopolis.variant_gene vg on vg.variant_id = v.id
join phenopolis.individual_variant iv on iv.variant_id = vg.variant_id 
join ensembl.gene g on g.ensembl_gene_id = vg.gene_id and g.assembly = 'GRCh37'
where vg.gene_id = 'ENSG00000144285'
order by v.chrom, v.pos
;

It gives the same results if user is Admin, AliceDavidson or demo with one critical difference: HOM and HET will show only the patients the users are allowed. And its Execution Time is ok (155.292 ms).

Now what I need from you is to decide which values (cols) do you want to send to the frontend. The table below shows what I got so far (Common), what is new and you may want to add (New schema) and what I didn't get so far (Old schema).

I do need help with the values from Old schema. For some of them I can build/derive it but, for most of them, I don't find where to get them.

I looked at e.g. kaviar table but I don't know how to join it, would it be with chrom+pos+ref+alt combination?

Old schema New schema Common
AF dbsnp chrom
AC variant_class pos
AN revel ref
HET_COUNT (can be calculated) fathmm_score alt
HOM_COUNT (can be calculated) qd dann
af_kaviar impact cadd_phred
af_gnomad_genomes canonical dp
af_jirdc fs
af_tommo mq
af_krgdb filter
af_converge hgvs_c
af_hgvd hgvs_p
gene_symbol (redundant) most_severe_consequence
MLEAC hom
MLEAF het
variant_id (can be composed)
id (different meaning) e.g. rs41279194 id (different meaning)*

* id | bigint | not null | nextval('phenopolis.variant_id_seq'::regclass)

alanwilter commented 3 years ago

Trying to get an idea about kaviar, cadd and gnomad, I did this:

select distinct
a.phred as phred_cadd, a.raw_score as raw_score_cadd, -- cadd : phred_cadd <> v.cadd_phred
av.ac as ac_gnomad, av.af  as af_gnomad, -- gnomad
ah.ac as ac_kaviar, ah.af as af_kaviar, ah.an, -- kaviar
v.*, iv.dp, iv."fs", iv.mq, iv.qd, iv."filter", --, iv.zygosity -- add individual_variant
vg.hgvs_c, vg.hgvs_p, vg.most_severe_consequence, vg.impact, vg.canonical, -- add variant_gene
(
    select array_agg(i.phenopolis_id)
    from phenopolis.individual i
    join public.users_individuals ui on ui.internal_id = i.phenopolis_id
    join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HOM'
    where ui."user" = 'Admin'
    and vg.variant_id = iv2.variant_id 
) as HOM,
(
    select array_agg(i.phenopolis_id)
    from phenopolis.individual i
    join public.users_individuals ui on ui.internal_id = i.phenopolis_id
    join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HET'
    where ui."user" = 'Admin'
    and vg.variant_id = iv2.variant_id 
) as HET
from phenopolis.variant v 
join phenopolis.variant_gene vg on vg.variant_id = v.id
join phenopolis.individual_variant iv on iv.variant_id = vg.variant_id 
join ensembl.gene g on g.ensembl_gene_id = vg.gene_id and g.assembly = 'GRCh37'
left outer join cadd.annotation a on a.chrom = v.chrom and a.pos = v.pos and a."ref" = v."ref" and a.alt = v.alt
left outer join gnomad.annotation_v3 av on av.chrom = v.chrom and av.pos = v.pos and av."ref" = v."ref" and av.alt = v.alt 
left outer join kaviar.annotation_hg19 ah on ah.chrom = v.chrom and ah.pos = v.pos and ah."ref" = v."ref" and ah.alt = v.alt
where vg.gene_id = 'ENSG00000144285'
order by v.chrom, v.pos

however, I can't explain why I have multiple row for kaviar e.g.:

phenopolis_dev_db=> select * from kaviar.annotation_hg19 where chrom='2' and pos=166854780 and "ref" = 'T' and alt = 'TA';
    id     | chrom |    pos    | ref | alt | ac  |    af     |  an   |
                         ds
-----------+-------+-----------+-----+-----+-----+-----------+-------+--------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 244761705 | 2     | 166854780 | T   | TA  |  35 | 0.0013269 | 26378 | ISB_founders-Nge3
 244761706 | 2     | 166854780 | T   | TA  | 400 | 0.0151642 | 26378 | GS000011735|GS000011816|GS000015156|GS000015893|GS000016450|Inova_CGI_founders-Nge3|Inova_Illumina_founders-N
ge3|SSIP|UK10K|gonl|phase3-ASW|phase3-BEB|phase3-CEU|phase3-CLM|phase3-FIN|phase3-GBR|phase3-GIH|phase3-IBS|phase3-MXL|phase3-PJL|phase3-PUR|phase3-STU|phase3-TSI
(2 rows)

Because of potential multiple kaviar rows I got more than 452 rows expected.

pontikos commented 3 years ago

Thanks for all these investigations @alanwilter !

Regarding joining on other tables yes: chrom, pos, ref and alt is the composite foreign key to use.

Regarding multiple rows for kaviar: I wasn't aware of this issue. I think workaround is to only keep one row per variant. Either by selecting the larger of the two AC values or only keeping the one from the most sources (ds column).

alanwilter commented 3 years ago

I forgot to call your attention to phred_cadd, present in phenopolis.variant.phred_cadd and different from cadd.annotation.phred, which value is correct?

For kaviar, in my example above, ac correlates with ds, so it would be simpler to use ac, but what is it and what it stands for? I'm going to investigate a bit further on kaviar and possibly test cadd and gnomad for duplications.

But, still, these are all collateral issues, my main point is to define which elements we're going to expose to the frontend, see table above.

pontikos commented 3 years ago

If cadd.annotation.phred exists for a variant then use that. That table contains all the precomputed case scores for known variants. If a variant is known I think we can stick to using that table (I think you need to make sure we are using the correct build which is 37 in our case). Sometimes cadd score may be missing from that table (for indels) in which case we need to compute it. In that case it will probably end up in the variants table. However to be consistent I think we should maybe insert it as an extra row in the case.annotarion table ?

About your more general question of what fields are displayed, it sounds like we need to agree on what the returned json looks like?

On Sat, 10 Apr 2021, 07:20 Alan Silva, @.***> wrote:

I forgot to call your attention to phred_cadd, present in phenopolis.variant.phred_cadd and different from cadd.annotation.phred, which value is correct?

For kaviar, in my example above, ac correlates with ds, so it would be simpler to use ac, but what is it and what it stands for? I'm going to investigate a bit further on kaviar and possibly test cadd and gnomad for duplications.

But, still, these are all collateral issues, my main point is to define which elements we're going to expose to the frontend, see table above.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/phenopolis/phenopolis_browser/issues/329#issuecomment-817086924, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5MN5HXRY675LEJGVPHLM3TH7UZ5ANCNFSM42RU5L5Q .

alanwilter commented 3 years ago

I've updated table Old x New schema for GENE web page. This is the piece of code in gene.py that uses the new schema and I'm trying to compare with old schema, using gene TTLL5:

def _get_variants_by_gene(gene_id):
    sqlq = sql.SQL(
        """
        select distinct v.chrom as "CHROM", v.pos as "POS", v."ref" as "REF", v.alt as "ALT", v.cadd_phred, v.dann,
        v.fathmm_score, v.revel, -- new added
        -- removed: v.id
        vg.most_severe_consequence, vg.hgvs_c as hgvsc, vg.hgvs_p as hgvsp, -- via variant_gene
        iv.dp as "DP", iv."fs" as "FS", iv.mq as "MQ", iv."filter" as "FILTER", -- via individual_variant
        (
            select array_agg(i.phenopolis_id)
            from phenopolis.individual i
            join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HOM'
            where vg.variant_id = iv2.variant_id
        ) as "HOM",
        (
            select array_agg(i.phenopolis_id)
            from phenopolis.individual i
            join phenopolis.individual_variant iv2 on iv2.individual_id = i.id and iv2.zygosity = 'HET'
            where vg.variant_id = iv2.variant_id
        ) as "HET",
        (
            select distinct on (ah.chrom,ah.pos,ah."ref",ah.alt) ah.af from kaviar.annotation_hg19 ah
            where ah.chrom = v.chrom and ah.pos = v.pos and ah."ref" = v."ref" and ah.alt = v.alt
            order by ah.chrom,ah.pos,ah."ref",ah.alt,ah.ac desc
        ) as af_kaviar,
        av.af as af_gnomad_genomes -- gnomad
        -- deprecated: MLEAF, MLEAC
        -- not used: gene_symbol, gene_id
        -- need to be added (by Daniele): af_converge, af_hgvd, af_jirdc, af_krgdb, af_tommo,
        from phenopolis.variant v
        join phenopolis.variant_gene vg on vg.variant_id = v.id
        join phenopolis.individual_variant iv on iv.variant_id = vg.variant_id
        left outer join gnomad.annotation_v3 av
            on av.chrom = v.chrom and av.pos = v.pos and av."ref" = v."ref" and av.alt = v.alt
        where vg.gene_id = %s
        order by v.chrom, v.pos, v."ref", v.alt
    """
    )
    with get_db() as conn:
        with conn.cursor() as cur:
            cur.execute(sqlq, [gene_id])
            variants = cursor2dict(cur)
    for v in variants:
        v["variant_id"] = [{"display": f'{v["CHROM"]}-{v["POS"]}-{v["REF"]}-{v["ALT"]}'}]
        het = v["HET"] or []
        hom = v["HOM"] or []
        v["HET_COUNT"] = len(het)
        v["HOM_COUNT"] = len(hom)
        v["AC"] = v["HET_COUNT"] + 2 * v["HOM_COUNT"]
        v["AN"] = (v["HET_COUNT"] + v["HOM_COUNT"]) * 2
        v["AF"] = v["AC"] / v["AN"]

    return variants

First thing @pontikos is if you can tell me if the formulae are correct:

        v["AC"] = v["HET_COUNT"] + 2 * v["HOM_COUNT"]
        v["AN"] = (v["HET_COUNT"] + v["HOM_COUNT"]) * 2
        v["AF"] = v["AC"] / v["AN"]

Another thing is why phenopolis.individual_variant sometimes offer more than one row for a given variant (14-76098265-T-C which shows in new query but not in old schema), e.g.:

select distinct iv.variant_id, iv.chrom, iv.pos, iv."ref", iv.alt, iv.zygosity, iv.dp, iv.mq, iv.qd
from phenopolis.individual_variant iv 
where iv.chrom = '14' and iv.pos = 76098265 and iv."ref" = 'T' and iv.alt = 'C';

 variant_id | chrom |   pos    | ref | alt | zygosity |  dp   |  mq   |  qd
------------+-------+----------+-----+-----+----------+-------+-------+-------
    3959887 | 14    | 76098265 | T   | C   | HET      | 33676 | 56.54 | 13.76
    3959887 | 14    | 76098265 | T   | C   | HET      | 35864 | 55.11 | 13.83

while this is not the case in the old schema (all rows are unique):

select * from public.variants v where 
v.gene_id = 'ENSG00000119685' -- 564 variants
order by "CHROM", "POS", "REF", "ALT" ; 

In the end, I have only 95 (or 94, variant_id 10846523 -- 14-76114226-A-ATG -- is not in phenopolis.individual_variant, but it's not in public.het_variants or public.hom_variants so it may explain why) unique variants in the new schema (if I remove the multi-rows with different DPs for a given variant). And only few are common to old x new schema.

I added another sheet (Compare52) to table Old x New schema for GENE web page where I try to analyse the 52 variants I found in common (but note new schema may have multiple rows for a given variant). Looking there and only af_kaviar seems to be stable (but not a perfect match).

I understand that phenopolis.variant_gene may not be complete and the scripts in our repository don't show we did a "migration" on these data. The only mention I see about populating phenopolis.variant_gene is scripts/import_individual_variants.py which does "Import an individual's VAR.tsv file".

As for phenopolis.individual_variant it seems more reliable since we have scripts/import_individual_variants.py, which did "_Migrate variants data from public.hom_variants and public.het_variants to phenopolis.individual_variant_"

Bottom line: as I don't know much about phenopolis.variant_gene I can only hope my code, barring some blatant mistakes, is doing what it's supposed to do yet I can't compare with previous schema.

pontikos commented 3 years ago

This is correct:

        v["AC"] = v["HET_COUNT"] + 2 * v["HOM_COUNT"]
        v["AN"] = (v["HET_COUNT"] + v["HOM_COUNT"]) * 2
        v["AF"] = v["AC"] / v["AN"]

Re phenopolis.individual_variant: if variant is seen in two people then the the MQ, DP etc might have been different between the two people, but anyway we do not display this information to users