It's pretty helpful for teaching purposes to be able to demonstrate small tsinfer input files without needing a VCF. I wonder if we could define a class method like this:
class SgkitSampleData:
...
def create_from_arrays(
cls,
path,
variant_positions,
variant_matrix_phased,
alleles,
ancestral_allele,
sample_id=None,
):
"""
Create an SgkitSampleData instance directly from array data. Only
useful for small datasets. Larger datasets should use e.g. bio2zarr
to create a zarr datastore containing the required data and call
SgkitSampleData(path_to_zarr)
:param path str: The path used to store the data
:param variant_positions array: a 1D array of variant positions
:param variant_matrix_phased array: a 3D array of variants X samples x ploidy,
giving an index into the allele array for each corresponding variant. Values must
be coercable into 8-bit (np.int8) integers. Data for all samples is assumed to be phased
:param alleles array: a 2D string array of variants x max_num_alleles at a site.
Each allele list for a variant must be the same length, equal to the
maximum value in the `variant_matrix_phased` array. Each allele list can be
padded with `""` to ensure the correct length
:param ancestral_allele array: a 1D string array specifying the ancestral allele
for each variant. For unknown ancestral alleles, any character which is not
in the allele list can be used.
:param sample_id: a 1D string array of sample names. If None, samples
(each corresponding to `ploidy` variant values) will be allocated the IDs
"0", "1", "2", .. etc.
"""
call_genotype=np.array(variant_matrix_phased, dtype=np.int8)
if sample_id is None:
sample_id = np.arange(call_genotype.shape[1]).astype(str)
# assume all phased
call_genotype_phased = np.ones(call_genotype.shape[:2], dtype=bool)
zarr.save_group(
path,
variant_position=np.array(pos),
call_genotype=call_genotype,
call_genotype_phased=call_genotype_phased,
variant_allele=np.array(alleles),
variant_ancestral_allele=np.array(ancestral_allele),
sample_id=sample_id,
)
return cls(path)
It's pretty helpful for teaching purposes to be able to demonstrate small tsinfer input files without needing a VCF. I wonder if we could define a class method like this:
Then we could use it like this: