Open hyanwong opened 2 months ago
Time units should probably be a parameter somewhere, and not something we expect to be in the Zarr dataset?
It's unlikely that sites_time
or individuals_time
is ever actually going be something derived from a VCF, so I think we can just see them as numpy arrays that need to be specified to the functions where they are used.
Yes, I agree. I think that means we need to add the individuals_time
parameter to the VariantData
constructor, in the same way as for sites_time. I assume that we add time_units
at the same time.
Currently we take
individuals_time
from the zarr store, if we want to be able to use historical samples. Do we want to be able to specify that on VariantData creation as well (likesites_time
), or are we treating this as an immutable property of the zarr dataset?The argument for this is that there's no standard way to provide the date of historical samples in a read-only dataset. This is sometimes in one of the sample metadata fields in the VCF, but we would probably want to allow the user to define the name of the field in their case. To match the VCF-Zarr terminology, we probably want to call this parameter
sample_time
(saving it internally in an "individuals_time" array).Related to this, how do we specify the
time_units
in tree sequence produced by the new inference process? I assume this too could be specified on creation of the VariantData object. I have changed my mind recently on what thetime_units
should refer to: thesites_time
is only really needed to provide a relative order, so I think even if asites_time
is given, we don't really care about matching it to any time units. It's only if historical samples (viasample_time
) are provided that the time units become important.