tskit-dev / tsinfer

Infer a tree sequence from genetic variation data.
GNU General Public License v3.0
56 stars 13 forks source link

How to save an SgkitSampleData instance, e.g. for running the CLI #925

Open hyanwong opened 5 months ago

hyanwong commented 5 months ago

It appears as if it's not possible to save an SgkitSampleData instance to a path. I'm not sure, therefore, how I might run the CLI on an zarr file, if I've specified bespoke masks, ancestral alleles, or wherever via numpy arrays (see #923)

I guess the easy way around this is to have a function that saves all the information such as bespoke masks / ancestral alleles into the zarr file (or make a copy of it if the original zarr is read-only?), and allow the CLI to run directly on that modified zarr:

tsinfer infer demo.vcz -O demo.trees

A more complex possibility for the user is to have the CLI accept the same parameters as tsinfer.SgkitSampleData(...), but then we might want to allow either numpy files or names in the .vcz file, which seems a bit icky, e.g.

tsinfer infer demo.vcz --variant_mask my_vmask_file.npy --ancestral_allele variant_AA -O demo.trees
jeromekelleher commented 5 months ago

I guess another possibility would be to provide an input CSV file (or Zarr) which is formatted with position, ancestral_allele, variant_age etc, which then populates the arrays appropriately.

hyanwong commented 5 months ago

Good point. This could even be a .npz file with the appropriately named variables. I guess that's basically the same as another zarr file. Probably best if @benjeffery weighs in with what he thinks would work best.

hyanwong commented 5 months ago

Here's a snippet from the docs I am trying to write

vcf2zarr explode demo.vcf.gz /tmp/demo.exploded
vcf2zarr encode /tmp/demo.exploded demo.vcz
# Here: how can I specify the ancestral state via a simple CLI command???
tsinfer generate-ancestors demo.vcz  # saves to demo.ancestors
tsinfer match-ancestors demo.vcz  # saves to demo.ancestors.trees
tsinfer match-samples demo.vcz  # saves to demo.trees

To parallelise the vcf2zarr steps you may wish to explore the --worker-processes option, or even split over partitions. To parallelise the tsinfer steps you may wish to investigate the --progress and --num_threads options.

Thoughts about how to specify on the command-line to use variant_AA as ancestral state, given a demo VCF with ancestral alleles embedded in the AA field, would be most welcome.

jeromekelleher commented 5 months ago

A JSON or yaml config file specifying inference parameters?