Open hyanwong opened 5 months ago
I guess another possibility would be to provide an input CSV file (or Zarr) which is formatted with position, ancestral_allele, variant_age
etc, which then populates the arrays appropriately.
Good point. This could even be a .npz file with the appropriately named variables. I guess that's basically the same as another zarr file. Probably best if @benjeffery weighs in with what he thinks would work best.
Here's a snippet from the docs I am trying to write
vcf2zarr explode demo.vcf.gz /tmp/demo.exploded vcf2zarr encode /tmp/demo.exploded demo.vcz # Here: how can I specify the ancestral state via a simple CLI command??? tsinfer generate-ancestors demo.vcz # saves to demo.ancestors tsinfer match-ancestors demo.vcz # saves to demo.ancestors.trees tsinfer match-samples demo.vcz # saves to demo.trees
To parallelise the vcf2zarr steps you may wish to explore the
--worker-processes
option, or even split over partitions. To parallelise the tsinfer steps you may wish to investigate the--progress
and--num_threads
options.
Thoughts about how to specify on the command-line to use variant_AA
as ancestral state, given a demo VCF with ancestral alleles embedded in the AA
field, would be most welcome.
A JSON or yaml config file specifying inference parameters?
It appears as if it's not possible to save an SgkitSampleData instance to a path. I'm not sure, therefore, how I might run the CLI on an zarr file, if I've specified bespoke masks, ancestral alleles, or wherever via numpy arrays (see #923)
I guess the easy way around this is to have a function that saves all the information such as bespoke masks / ancestral alleles into the zarr file (or make a copy of it if the original zarr is read-only?), and allow the CLI to run directly on that modified zarr:
A more complex possibility for the user is to have the CLI accept the same parameters as
tsinfer.SgkitSampleData(...)
, but then we might want to allow either numpy files or names in the .vcz file, which seems a bit icky, e.g.