tskit-dev / tsinfer

Infer a tree sequence from genetic variation data.
GNU General Public License v3.0
54 stars 13 forks source link

Check all alleles at a site are unique before inference #927

Open hyanwong opened 1 month ago

hyanwong commented 1 month ago

I'm finding that tsinfer isn't correctly inferring the genotypes for Sgkit files in the trivial instance below. This seems pretty worrying. What do I have wrong?

import sgkit as sg
import tsinfer

ds = sg.simulate_genotype_call_dataset(n_variant=10, n_sample=4, missing_pct=0, phased=True, seed=1)
ds['variant_allele'] = ds['variant_allele'].astype(str) # hack around https://github.com/tskit-dev/tsinfer/issues/810
ds.update({'variant_ancestral_allele': ds['variant_allele'][:,0]})
ds.to_zarr('/tmp/ds.zarr', mode='w')
sd = tsinfer.SgkitSampleData('/tmp/ds.zarr')
ts = tsinfer.infer(sd)
print("Inferred_genos    Zarr_genotypes    SampleData_genos  Ts alleles")
for v, sgv, sdv in zip(ts.variants(), ds.call_genotype, sd.sites_genotypes):
    bad = "<<<< BAD!!!" if not (v.genotypes == sdv).all() else ""
    print(v.genotypes, sgv.values.flatten(), sdv, v.alleles, bad)

Shows that the 7th and 8th site are incorrect:

Inferred_genos    Zarr_genotypes    SampleData_genos  Ts alleles
[1 0 1 0 1 0 1 1] [1 0 1 0 1 0 1 1] [1 0 1 0 1 0 1 1] ('G', 'C') 
[0 1 1 0 0 0 0 0] [0 1 1 0 0 0 0 0] [0 1 1 0 0 0 0 0] ('G', 'A') 
[1 0 1 0 1 1 0 0] [1 0 1 0 1 1 0 0] [1 0 1 0 1 1 0 0] ('T', 'A') 
[1 0 1 1 1 0 1 1] [1 0 1 1 1 0 1 1] [1 0 1 1 1 0 1 1] ('G', 'A') 
[1 1 1 1 0 1 0 0] [1 1 1 1 0 1 0 0] [1 1 1 1 0 1 0 0] ('C', 'G') 
[0 0 1 1 1 0 0 1] [0 0 1 1 1 0 0 1] [0 0 1 1 1 0 0 1] ('G', 'A') 
[0 0 0 0 0 0 0 0] [0 1 0 1 1 0 0 1] [0 1 0 1 1 0 0 1] ('T',) <<<< BAD!!!
[0 0 0 0 0 0 0 0] [1 0 0 0 0 1 0 1] [1 0 0 0 0 1 0 1] ('C',) <<<< BAD!!!
[0 1 0 1 1 1 1 1] [0 1 0 1 1 1 1 1] [0 1 0 1 1 1 1 1] ('T', 'G') 
[0 0 1 1 0 1 1 0] [0 0 1 1 0 1 1 0] [0 0 1 1 0 1 1 0] ('A', 'G') 
jeromekelleher commented 1 month ago

Can you show the zarr alleles fields as well please (and ancestestral state?)

hyanwong commented 1 month ago

Aha! The ds['variant_allele'] array created by sg.simulate_genotype_call_dataset is wrong: it has duplicate allele states (e.g. two Ts or two Cs)

ds = sg.simulate_genotype_call_dataset(n_variant=10, n_sample=4, missing_pct=0, phased=True, seed=1)
print(ds['variant_allele'])
ds['variant_allele'] = ds['variant_allele'].astype(str) # hack around https://github.com/tskit-dev/tsinfer/issues/810
print(ds['variant_allele'])
ds.update({'variant_ancestral_allele': ds['variant_allele'][:,0]})
ds.to_zarr('/tmp/ds.zarr', mode='w')
sd = tsinfer.SgkitSampleData('/tmp/ds.zarr')
ts = tsinfer.infer(sd)
print("Inferred_genos    Zarr_genotypes    SampleData_genos  SD alleles")
for v, sgv, sdv, sda, sdaa in zip(
    ts.variants(), ds.call_genotype, sd.sites_genotypes, sd.sites_alleles, sd.sites_ancestral_allele, 
):
    bad = "<<<< BAD!!!" if not (v.genotypes == sdv).all() else ""
    print(v.genotypes, sgv.values.flatten(), sdv, sda, sdaa, bad)
<xarray.DataArray 'variant_allele' (variants: 10, alleles: 2)> Size: 20B
array([[b'G', b'C'],
       [b'G', b'A'],
       [b'T', b'A'],
       [b'G', b'A'],
       [b'C', b'G'],
       [b'G', b'A'],
       [b'T', b'T'],
       [b'C', b'C'],
       [b'T', b'G'],
       [b'A', b'G']], dtype='|S1')
Dimensions without coordinates: variants, alleles
Attributes:
    comment:  The possible alleles for the variant.
<xarray.DataArray 'variant_allele' (variants: 10, alleles: 2)> Size: 80B
array([['G', 'C'],
       ['G', 'A'],
       ['T', 'A'],
       ['G', 'A'],
       ['C', 'G'],
       ['G', 'A'],
       ['T', 'T'],
       ['C', 'C'],
       ['T', 'G'],
       ['A', 'G']], dtype='<U1')
Dimensions without coordinates: variants, alleles
Attributes:
    comment:  The possible alleles for the variant.
Inferred_genos    Zarr_genotypes    SampleData_genos  SD alleles
[1 0 1 0 1 0 1 1] [1 0 1 0 1 0 1 1] [1 0 1 0 1 0 1 1] ['G' 'C'] 0 
[0 1 1 0 0 0 0 0] [0 1 1 0 0 0 0 0] [0 1 1 0 0 0 0 0] ['G' 'A'] 0 
[1 0 1 0 1 1 0 0] [1 0 1 0 1 1 0 0] [1 0 1 0 1 1 0 0] ['T' 'A'] 0 
[1 0 1 1 1 0 1 1] [1 0 1 1 1 0 1 1] [1 0 1 1 1 0 1 1] ['G' 'A'] 0 
[1 1 1 1 0 1 0 0] [1 1 1 1 0 1 0 0] [1 1 1 1 0 1 0 0] ['C' 'G'] 0 
[0 0 1 1 1 0 0 1] [0 0 1 1 1 0 0 1] [0 0 1 1 1 0 0 1] ['G' 'A'] 0 
[0 0 0 0 0 0 0 0] [0 1 0 1 1 0 0 1] [0 1 0 1 1 0 0 1] ['T' 'T'] 0 <<<< BAD!!!
[0 0 0 0 0 0 0 0] [1 0 0 0 0 1 0 1] [1 0 0 0 0 1 0 1] ['C' 'C'] 0 <<<< BAD!!!
[0 1 0 1 1 1 1 1] [0 1 0 1 1 1 1 1] [0 1 0 1 1 1 1 1] ['T' 'G'] 0 
[0 0 1 1 0 1 1 0] [0 0 1 1 0 1 1 0] [0 0 1 1 0 1 1 0] ['A' 'G'] 0 

Edit - reported as a bug in https://github.com/sgkit-dev/sgkit/issues/1221

hyanwong commented 1 month ago

I didn't realise this could happen. Perhaps it's worth bugging out if there are duplicated allelic states for a site, as this is pretty confusing.

jeromekelleher commented 1 month ago

Nothing actually checks this I guess, so worth detecting and bugging out here.

Clearly a bug in the sgkit simulator, though.

benjeffery commented 1 month ago

Phew, had me worried there for a moment! As Jerome says we should error out on this condition.

hyanwong commented 1 month ago

I changed the name of the issue to reflect what needs doing.