Closed hyanwong closed 1 month ago
Wouldn't be quite that simple though, because of singletons etc. How would you get the array you pass in from the tree sequence timed sites to have the same length as the original Zarr? Easy enough to add utility functions to do so I guess.
I'm assuming (for didactic/teaching purposes) that we are not masking out any sites, and that singletons are phased.
I presume that pipelines with real data would, indeed, need to have some sort of wrapper functions. A VariantData
method that padded out sites with a "missing data" value would indeed be convenient. E.g.
site_times_from_ts = tsdate.util.sites_time_from_ts(dated_ts)
all_site_times = data.fill_variant_mask(site_times_from_ts, fill_value=np.nan) # new function
new_data = VariantData("myfile.vcz", variant_time=all_site_times)
I think it also needs the positions, though, or else it can't merge. Something like
site_times = data.pad_variants_array(dated_ts.sites_position.astype(int), site_times_from_ts, fill=np.nan)
I think it also needs the positions, though, or else it can't merge. Something like
site_times = data.pad_variants_array(dated_ts.sites_position.astype(int), site_times_from_ts, fill=np.nan)
I was thinking that the data
object should know which sites were masked out. But I totally agree that it's much less prone to error if we use the positions. So picking up on your idea, I think the neatest thing would be to have a alternative tsdate
function that also returns the positions: e.g. "mut_node_time_from_ts". Then we could simply do:
pos, times = tsdate.util.mut_node_time_from_ts(dated_ts)
data.pad_variants_array(pos, times, fill=np.nan) # maybe try the (int) conversion for positions within the pad_ method?
I think we want the time of the node below the mutation (as we are really trying to use the best estimate of the time for a node, rather than the time of the mutation above that node), hence the suggestion for the method name. It's worth retaining this as a method of tsdate, as we ideally want to use the unconstrained times in the tsdate
-encoded metadata, rather than the times from ts.nodes_time
Another thing: we relatively often want to provide a mask that calculates something off the Zarr file, e.g. "mask if variant_quality
< 20". Should we recommend that this be done in SGkit, or (as below) by creating a numpy array in a preprocessing step, or by allowing the mask parameters to be a lambda function?
mask = zarr.load("demo.vcz")["variant_quality"] < 20
sd = SampleData("demo.vcz", variant_mask=mask)
# or allow a function that takes the zarr object as the only param
# probably not worth the complexity, unless it is greatly more efficient
sd = SampleData("demo.vcz", variant_mask=lambda z: z.variant_quality < 20)
I don't think we want to recommend doing any real compute in the constructor - in practise these masks will probably involve QC on call-level fields that'll need to be done in advance and saved somewhere.
So, in practise I think allowing the mask to be a function would just lead to complexity and confusion.
@jeromekelleher said on slack
This would make it easy to pass in new ages for reinference without messing with the datafile on-disk:
And as Jerome says, it adds flexibility in the case that you don't have write access to original zarr