sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
235 stars 32 forks source link

simulate_genotype_call_dataset creates duplicate alleles #1221

Open hyanwong opened 5 months ago

hyanwong commented 5 months ago

E.g. we can get 2 "C" values in ds['variant_allele']:

import sgkit as sg
import numpy as np

ds = sg.simulate_genotype_call_dataset(n_variant=10, n_sample=4, missing_pct=0, phased=True, seed=1)
for i, alleles in enumerate(ds['variant_allele'].values):
    print(f"Site {i}: {alleles}")
    assert len(np.unique(alleles)) == len(alleles)

Fails on site 6:

Site 6: [b'T' b'T']
---------------------------------------------------------------------------
AssertionError

This can cause much confusion in downstream analysis. See https://github.com/tskit-dev/tsinfer/issues/927