phenopolis / phenopolis_genomics_browser

Python API and React frontend for the Phenopolis Genomics Browser
https://dev-live.phenopolis.org
MIT License
31 stars 2 forks source link

Issues with phenopolis.individual_variant table #342

Closed alanwilter closed 3 years ago

alanwilter commented 3 years ago

Take https://dev-live.phenopolis.org/individual/PH00000166, which has 1,565 rare variants.

The query in the old schema (Q1):

select count(pv.*)
from public.het_variants hv 
join public.variants pv
    on (hv."CHROM", hv."POS", hv."REF", hv."ALT") = (pv."CHROM", pv."POS", pv."REF", pv."ALT")
join public.individuals ind on ind.internal_id = hv.individual
where hv.individual = 'PH00000166'
--and (pv.hgvsc, pv.hgvsp) != ('', '') -- filter
;

would give the number above and, if applying the commented filter above, as seen in scripts/migrate_variants.sh, this would give 1,252 rows.

Now using the new schema, the equivalent query (Q2):

select iv.zygosity,count(*) from phenopolis.variant v 
join phenopolis.individual_variant iv on iv.variant_id = v.id 
where iv.individual_id = 166
group by iv.zygosity 

would give precisely the same 1,252 for HET.

However, when trying with PH00000168 something is very off. Using Q1 we got 6,432 (7,026 without filter) but when using Q2, it gives 68,770 !!!

Analysing script scripts/migrate_individual_variants.sh, the one that populates phenopolis.individual_variant:

select zig, type, count(*) from (

select ind.internal_id, ind.external_id, pv.variant_id, pv."CHROM", pv."POS", pv."REF", pv."ALT", 'HOM' as zig, 'internal_id' as "type"
from public.hom_variants hv 
join public.variants pv
    on (hv."CHROM", hv."POS", hv."REF", hv."ALT") = (pv."CHROM", pv."POS", pv."REF", pv."ALT")
join public.individuals ind on ind.internal_id = hv.individual
where (pv.hgvsc, pv.hgvsp) != ('', '')

union

select ind.internal_id, ind.external_id, pv.variant_id, pv."CHROM", pv."POS", pv."REF", pv."ALT", 'HOM' as zig, 'external_id' as "type"
from public.hom_variants hv 
join public.variants pv
    on (hv."CHROM", hv."POS", hv."REF", hv."ALT") = (pv."CHROM", pv."POS", pv."REF", pv."ALT")
join public.individuals ind on ind.external_id = hv.individual
where (pv.hgvsc, pv.hgvsp) != ('', '')

union

select ind.internal_id, ind.external_id, pv.variant_id, pv."CHROM", pv."POS", pv."REF", pv."ALT", 'HET' as zig, 'internal_id' as "type"
from public.het_variants hv 
join public.variants pv
    on (hv."CHROM", hv."POS", hv."REF", hv."ALT") = (pv."CHROM", pv."POS", pv."REF", pv."ALT")
join public.individuals ind on ind.internal_id = hv.individual
where (pv.hgvsc, pv.hgvsp) != ('', '')

union

select ind.internal_id, ind.external_id, pv.variant_id, pv."CHROM", pv."POS", pv."REF", pv."ALT", 'HET' as zig, 'external_id' as "type"
from public.het_variants hv 
join public.variants pv
    on (hv."CHROM", hv."POS", hv."REF", hv."ALT") = (pv."CHROM", pv."POS", pv."REF", pv."ALT")
join public.individuals ind on ind.external_id = hv.individual
where (pv.hgvsc, pv.hgvsp) != ('', '')

) as foo
where internal_id = 'PH00000168'
group by zig, type
;
zig type count
HET internal_id 6,432
HOM internal_id 330

I see it should have generated the correct rows for PH00000168 in phenopolis.individual_variant but it clearly not the case.

Now I'm trying to find out what went wrong. Audit tables (how?), logs (where?).

My understanding is that scripts/migrate_individual_variants.sh should have created: zigosity count
HET 5,151,780
HOM 302,287

and not

select iv.zygosity, count(*) from phenopolis.individual_variant iv
group by iv.zygosity;
zygosity count
HET 7,372,330
HOM 1,559,814

BTW, contrary to the note seen in scripts/migrate_individual_variants.sh:

# Note: Some dbs have external_id in the variant's individual colummn, some # have the internal db, so import both the "styles".

I haven't seen any case for public.individuals ind on ind.external_id = hv.individual.

alanwilter commented 3 years ago

I'm closing it. It seems to be fine, we just need to make sure the frontend start to make use of attributes limit and offset and when sending a api request to limit to, say, 1000 rows.