tskit-dev / tsinfer

Infer a tree sequence from genetic variation data.
GNU General Public License v3.0
56 stars 13 forks source link

VariantData: do we want to be able to specify individuals_time on construction? #968

Open hyanwong opened 2 months ago

hyanwong commented 2 months ago

Currently we take individuals_time from the zarr store, if we want to be able to use historical samples. Do we want to be able to specify that on VariantData creation as well (like sites_time), or are we treating this as an immutable property of the zarr dataset?

The argument for this is that there's no standard way to provide the date of historical samples in a read-only dataset. This is sometimes in one of the sample metadata fields in the VCF, but we would probably want to allow the user to define the name of the field in their case. To match the VCF-Zarr terminology, we probably want to call this parameter sample_time (saving it internally in an "individuals_time" array).

Related to this, how do we specify the time_units in tree sequence produced by the new inference process? I assume this too could be specified on creation of the VariantData object. I have changed my mind recently on what the time_units should refer to: the sites_time is only really needed to provide a relative order, so I think even if a sites_time is given, we don't really care about matching it to any time units. It's only if historical samples (via sample_time) are provided that the time units become important.

jeromekelleher commented 2 months ago

Time units should probably be a parameter somewhere, and not something we expect to be in the Zarr dataset?

It's unlikely that sites_time or individuals_time is ever actually going be something derived from a VCF, so I think we can just see them as numpy arrays that need to be specified to the functions where they are used.

hyanwong commented 2 months ago

Yes, I agree. I think that means we need to add the individuals_time parameter to the VariantData constructor, in the same way as for sites_time. I assume that we add time_units at the same time.