tskit-dev / tsinfer

Infer a tree sequence from genetic variation data.
GNU General Public License v3.0
54 stars 13 forks source link

Class method to create simple SgkitSampleData files for demos #924

Open hyanwong opened 2 months ago

hyanwong commented 2 months ago

It's pretty helpful for teaching purposes to be able to demonstrate small tsinfer input files without needing a VCF. I wonder if we could define a class method like this:

class SgkitSampleData:
    ...
    def create_from_arrays(
        cls,
        path,
        variant_positions,
        variant_matrix_phased,
        alleles,
        ancestral_allele,
        sample_id=None,
    ):
        """
        Create an SgkitSampleData instance directly from array data. Only
        useful for small datasets. Larger datasets should use e.g. bio2zarr
        to create a zarr datastore containing the required data and call
        SgkitSampleData(path_to_zarr)

        :param path str: The path used to store the data
        :param variant_positions array: a 1D array of variant positions
        :param variant_matrix_phased array: a 3D array of variants X samples x ploidy,
            giving an index into the allele array for each corresponding variant. Values must
            be coercable into 8-bit (np.int8) integers. Data for all samples is assumed to be phased
        :param alleles array: a 2D string array of variants x max_num_alleles at a site.
            Each allele list for a variant must be the same length, equal to the
            maximum value in the `variant_matrix_phased` array. Each allele list can be
            padded with `""` to ensure the correct length
        :param ancestral_allele array: a 1D string array specifying the ancestral allele
            for each variant. For unknown ancestral alleles, any character which is not
            in the allele list can be used.
        :param sample_id: a 1D string array of sample names. If None, samples
            (each corresponding to `ploidy` variant values) will be allocated the IDs
            "0", "1", "2", .. etc.
        """
        call_genotype=np.array(variant_matrix_phased, dtype=np.int8)
        if sample_id is None:
            sample_id = np.arange(call_genotype.shape[1]).astype(str)
        # assume all phased
        call_genotype_phased = np.ones(call_genotype.shape[:2], dtype=bool)
        zarr.save_group(
            path,
            variant_position=np.array(pos),
            call_genotype=call_genotype,
            call_genotype_phased=call_genotype_phased,
            variant_allele=np.array(alleles),
            variant_ancestral_allele=np.array(ancestral_allele),
            sample_id=sample_id,
        )
        return cls(path)

Then we could use it like this:

# For demo only: larger examples could pass `data` as a large numpy array directly
pos, data, alleles, ancestral = list(zip(*[
    (3,  [[0, 1], [0, 0], [0, 0]], ["A", "T", ""], "A"),
    (10, [[0, 1], [1, 1], [0, 0]], ["C", "A", ""], "C"),
    (13, [[0, 1], [1, 0], [0, 0]], ["G", "C", ""], "-"),
    (19, [[0, 0], [0, 1], [1, 0]], ["A", "C", ""], "A"),
    (20, [[0, 1], [2, 0], [0, 0]], ["T", "G", "C"], "T"),
]))
sd = tsinfer.SgkitSampleData.create_from_arrays("test.vcz", pos, data, alleles, ancestral)
ts = tsinfer.infer(sd)
jeromekelleher commented 2 months ago

We're going to need something like this for testing anyway - even better if it was just an in-memory store.