Open hyanwong opened 1 month ago
Can you show the zarr alleles fields as well please (and ancestestral state?)
Aha! The ds['variant_allele']
array created by sg.simulate_genotype_call_dataset
is wrong: it has duplicate allele states (e.g. two Ts or two Cs)
ds = sg.simulate_genotype_call_dataset(n_variant=10, n_sample=4, missing_pct=0, phased=True, seed=1)
print(ds['variant_allele'])
ds['variant_allele'] = ds['variant_allele'].astype(str) # hack around https://github.com/tskit-dev/tsinfer/issues/810
print(ds['variant_allele'])
ds.update({'variant_ancestral_allele': ds['variant_allele'][:,0]})
ds.to_zarr('/tmp/ds.zarr', mode='w')
sd = tsinfer.SgkitSampleData('/tmp/ds.zarr')
ts = tsinfer.infer(sd)
print("Inferred_genos Zarr_genotypes SampleData_genos SD alleles")
for v, sgv, sdv, sda, sdaa in zip(
ts.variants(), ds.call_genotype, sd.sites_genotypes, sd.sites_alleles, sd.sites_ancestral_allele,
):
bad = "<<<< BAD!!!" if not (v.genotypes == sdv).all() else ""
print(v.genotypes, sgv.values.flatten(), sdv, sda, sdaa, bad)
<xarray.DataArray 'variant_allele' (variants: 10, alleles: 2)> Size: 20B
array([[b'G', b'C'],
[b'G', b'A'],
[b'T', b'A'],
[b'G', b'A'],
[b'C', b'G'],
[b'G', b'A'],
[b'T', b'T'],
[b'C', b'C'],
[b'T', b'G'],
[b'A', b'G']], dtype='|S1')
Dimensions without coordinates: variants, alleles
Attributes:
comment: The possible alleles for the variant.
<xarray.DataArray 'variant_allele' (variants: 10, alleles: 2)> Size: 80B
array([['G', 'C'],
['G', 'A'],
['T', 'A'],
['G', 'A'],
['C', 'G'],
['G', 'A'],
['T', 'T'],
['C', 'C'],
['T', 'G'],
['A', 'G']], dtype='<U1')
Dimensions without coordinates: variants, alleles
Attributes:
comment: The possible alleles for the variant.
Inferred_genos Zarr_genotypes SampleData_genos SD alleles
[1 0 1 0 1 0 1 1] [1 0 1 0 1 0 1 1] [1 0 1 0 1 0 1 1] ['G' 'C'] 0
[0 1 1 0 0 0 0 0] [0 1 1 0 0 0 0 0] [0 1 1 0 0 0 0 0] ['G' 'A'] 0
[1 0 1 0 1 1 0 0] [1 0 1 0 1 1 0 0] [1 0 1 0 1 1 0 0] ['T' 'A'] 0
[1 0 1 1 1 0 1 1] [1 0 1 1 1 0 1 1] [1 0 1 1 1 0 1 1] ['G' 'A'] 0
[1 1 1 1 0 1 0 0] [1 1 1 1 0 1 0 0] [1 1 1 1 0 1 0 0] ['C' 'G'] 0
[0 0 1 1 1 0 0 1] [0 0 1 1 1 0 0 1] [0 0 1 1 1 0 0 1] ['G' 'A'] 0
[0 0 0 0 0 0 0 0] [0 1 0 1 1 0 0 1] [0 1 0 1 1 0 0 1] ['T' 'T'] 0 <<<< BAD!!!
[0 0 0 0 0 0 0 0] [1 0 0 0 0 1 0 1] [1 0 0 0 0 1 0 1] ['C' 'C'] 0 <<<< BAD!!!
[0 1 0 1 1 1 1 1] [0 1 0 1 1 1 1 1] [0 1 0 1 1 1 1 1] ['T' 'G'] 0
[0 0 1 1 0 1 1 0] [0 0 1 1 0 1 1 0] [0 0 1 1 0 1 1 0] ['A' 'G'] 0
Edit - reported as a bug in https://github.com/sgkit-dev/sgkit/issues/1221
I didn't realise this could happen. Perhaps it's worth bugging out if there are duplicated allelic states for a site, as this is pretty confusing.
Nothing actually checks this I guess, so worth detecting and bugging out here.
Clearly a bug in the sgkit simulator, though.
Phew, had me worried there for a moment! As Jerome says we should error out on this condition.
I changed the name of the issue to reflect what needs doing.
I'm finding that tsinfer isn't correctly inferring the genotypes for Sgkit files in the trivial instance below. This seems pretty worrying. What do I have wrong?
Shows that the 7th and 8th site are incorrect: