Closed eric-czech closed 4 years ago
Nice summary, thanks @eric-czech. I'd be instinctively inclined to prefer your "ex-situ" pattern above, but I'm not an experienced enough xarray user to have a real opinion.
Something to consider: what if we have a common keyword arg to functions update=False
which updated the input dataset with the result variables if True? It'll still return the new variables as before in this case, it's just whether the input is updated or not.
what if we have a common keyword arg to functions update=False
I like that. A way forward there that makes sense to me is:
update=False
) OR merge them with the provided dataset and return that result (update=True
)I would mildly recommend we try to get ahead of the fact that the merging can get complicated and consider something like a merge
argument instead where it has one of 3 values: true, false, or kwargs for xr.merge.
EDIT
Actually, the more I think about using that the more I think it makes sense to either set update=False or have the function take full control over the merging.
I'm with you until (3) here @eric-czech:
At the very end of each function, return the new variables or merge them in based on the flag
I would say "At the very end of each function merge the new variables in based on the flag and return them". There's no reason not to return the variables in the update=True
case, right? Or, do you mean that we return the input, updated dataset in this case? The semantics are bit simpler if we always return the same thing, aren't they?
I agree update=False
should be the default. I'd find it surprising if an API modifies its arguments by default (although I hate to invoke the principle of least surprise!). Some (otherwise very sensible) folks get quite religious about functional patterns, so we don't want to turn them off the API on a matter of principle.
At the very end of each function merge the new variables in based on the flag and return them
Yea sorry, that's what I meant. Either you get the original dataset with the new variables added to it or you get a dataset just containing the new variables.
I agree update=False should be the default. I'd find it surprising if an API modifies its arguments by default
Me too. A pipeline would then look like ds.pipe(fn1, update=True).pipe(fn2, update=True)
. IIRC @tomwhite suggested trying to support that better twice now so it would be good to hear his thoughts on it. The best case I can think of for using pipelines in a GWAS workflow would be at the beginning when you attach all the summary stats so I haven't felt very strongly in favor of them over xr.merge([ ds, fn1(ds), fn2(ds) ])
.
I agree
update=False
should be the default. I'd find it surprising if an API modifies its arguments by default
I don't think this is right. The merged dataset is a new object, so the original dataset is unchanged. (They share backing data arrays, of course, but these should be viewed as immutable.)
This is why I think there is a strong argument for making update=True
the default, since for new users it's easy to explain that everything is accumulated in the newly returned dataset. This is the model that I have in my mind, possibly influenced by Scanpy, which uses Anndata containers. (And for rare operations that produce lots of new variables, the user can set update=False
.)
Perhaps it should be called merge
instead of update
, to avoid the implication that the existing dataset is being mutated? And that would also echo xarray terminology.
I don't think this is right. The merged dataset is a new object, so the original dataset is unchanged
Ah yea agreed on that as long as we don't do ds[some_var] = ...
in the functions. I meant it more like it's surprising to me that it seems like the API is making modifications to the original dataset. +1 to merge
instead of update
even if just for that reason.
Sounds good @tomwhite and @eric-czech - +1 to merge
too.
So, just so I'm totally clear here, the proposed pattern is
ds2 = sgkit.some_function(ds1, merge=True)
ds3 = sgkit.some_function(ds1, merge=False)
We have:
(ds1 is not ds2) and (ds3 is not ds1)
ds2
is a shallow copy of ds1, with the extra variables computed by some_function
merged in.ds3
is a new dataset consisting of just the variables computed by some_function
Do I have this right? If so, SGTM, and yes, merge=True
is the right default.
Do I have this right?
Exactly.
for new users it's easy to explain that everything is accumulated in the newly returned dataset. This is the model that I have in my mind, possibly influenced by Scanpy
Hail does it too and I agree it's one less thing for new users to learn. In the interest of moving this forward, it sounds like we're all in agreement on https://github.com/pystatgen/sgkit/issues/103#issuecomment-672091886 (the numbered steps a few posts above) with the only difference being we'll call the boolean flag merge
.
What is being described here is very similar to the inplace
argument in most pandas methods e.g. DataFrame.drop:
inplace : bool, default False
If False, return a copy. Otherwise, do operation inplace and return None.
I think it's fair to assume that most users will have at least passing familiarity with pandas and hence replicating that behavior may be the most intuitive.
Pandas defaults to creating a new dataframe with the changes applied to that copy (not the input arg). The equivalent for an xarray object would be:
The merged dataset is a new object, so the original dataset is unchanged. (They share backing data arrays, of course, but these should be viewed as immutable.)
Edit: Though I'm not exactly certain on how pandas deals re shallow vs deep copies when creating the new DataFrame
Just to clarify, in @jeromekelleher example the default behavior ds2 = sgkit.some_function(ds1, merge=True)
would be analogous to the default behavior in pandas of df2 = df1.some_method(inplace=False)
, which is probably quite intuitive for Pandas users.
But ds3 = sgkit.some_function(ds1, merge=False)
is obviously completely different to df1.some_method(inplace=True)
so I'm not suggesting using the same keyword.
Thanks @timothymillar, this is very helpful input and great to know.
it sounds like we're all in agreement on https://github.com/pystatgen/sgkit/issues/103#issuecomment-672091886 (the numbered steps a few posts above) with the only difference being we'll call the boolean flag
merge
.
+1
One other thing I'd like to bring up here is that ideas were floated in the past to add arguments to control the names of the variables that get created. Given that we want to standardize on a global naming convention and that some functions will produce a large number of variables, I would be mildly opposed to that.
One practical situation I can think of that could require customizing variable names is when trying to prefix or suffix them as part of some workflow that colocates the same variables from different sources. An example would be if I wanted a dataset containing both 1000 genomes and UKB variables aligned on some coordinates or index. I think the ability to "align" datasets like that is actually a really killer feature for sgkit and that it's also something that's supported well over multiple Datasets by xr.align or xr.sel with indexes (so customizing names isn't really necessary).
I can almost imagine some scenarios where something like a tf.name_scope would be helpful for controlling variable suffixes/prefixes, but it also seems fairly easy to work around with fn(ds[...].rename_vars(...), merge=False).rename_vars(...)
.
Anyways, I think it would be good to know where we all stand on that before making changes.
One other thing I'd like to bring up here is that ideas were floated in the past to add arguments to control the names of the variables that get created.
I don't really have an opinion. Let's build some stuff and think about making rules later, would be my take.
Related to this what happens if a value in the input-data set is re-calculated? will this be common? For example there are multiple methods for calculating expected heterozygosity, will every method produce a unique variable name or will they or result in the same variable name to ease downstream use?
If a user re-calculates 'heterozygosity_expected'
(with merge=True
) on a dataset that already contains a value of that name then they should probably see a warning something like:
MergeWarning: The following values in the input dataset will be replaced in the output: 'heterozygosity_expected'
The warnings can then be promoted to errors in 'production' pipelines ensure that there is no variable clobbering.
One other thing I'd like to bring up here is that ideas were floated in the past to add arguments to control the names of the variables that get created.
A simple minded approach to this would be to include the arguments map_inputs
and/or map_outputs
which each take a dict mapping the 'default' variable name to custom names.
Today I've been trying to create a PR for #43 and I'm in a little bit of a pickle. We currently have only a handful of computation functions, but already bunch of inconsistencies:
Dataset
(gwas_linear_regression
) whilst other DataArray
(count_alleles
)hardy_weinberg_test
), others don't allow that at all count_alleles
and others require to provide some of them (gwas_linear_regression
)I would argue that consistent API is something we should strive for (and we currently have only ~4 functions), I can see that there is already an argument made for the 1st issue here - and I fully support always returning Dataset
. Regarding 2nd issue, do we always list all variables used (i/o maps), do we list some and when etc. I would argue we should try to find a way to pass that information as part of the dataset
(in attributes/schema or wrapper)? Alternatively we could also allow to override all variable names for all functions. What do you think?
Alternatively we could also allow to override all variable names for all functions. What do you think?
TBH, what I'd personally like most as a user is:
This would mean I can use whatever naming conventions I'd like for any variables in any context, and it would be ok if some of the input/output variables are conditional.
This contradicts the idea of a global namespace for variables (like I mentioned in https://github.com/pystatgen/sgkit/issues/87#issuecomment-672231814) but it would be most flexible. I see it as the polar opposite to any kind of "schemification" that would enforce the global names.
Regarding 2nd issue, do we always list all variables used (i/o maps), do we list some and when etc
I think if we go the schemification route, it may make sense to do away with any control the user has over variable names? The two extremes I see are fn(ds: TypedDataset) -> SomeOtherTypedDataset
and fn(ds: Dataset, input_vars: Dict) -> Dataset # where ouput is mutually exclusive w/ input
.
I completely agree we should find something more consistent!
Thanks for raising the inconsistencies @ravwojdyla. I think where it is required to specify variable names, that should be clear from required parameters in the function (e.g. covariates
for gwas_linear_regression()
), and if they are optional then the Python parameter would be optional (e.g. genotype_counts
for hardy_weinberg_test()
). So that seems OK to me at the meoment.
As a new user I would expect to be able to run a pipeline with all the defaults and have it just work. So that implies there is some kind of global namespace.
I like the Eric's idea of being able to override all input variables, but no output variables (user is responsible for that).
And +1 to making count_alleles
return a Dataset
.
Can this be closed with the merge of #217?
Not every function will fit the
fn(ds: Dataset, ...) -> Dataset
signature, but for the large majority that do we have so far been adopting a convention that only returns newly created variables. Another option would be to always try to write those variables into the input dataset. For lack of a better phrase I'll call the former "Ex situ" updates and the latter "In situ" updates. Here are some pros/cons of each:Ex situ updates
ds.merge(fn(ds))
is clunky if you wanted the variables merged in the first placeds = xr.merge([ fn1(ds), fn2(ds) ])
In situ updates
ds.pipe(count_alleles).pipe(allele_frequency)
My more opinionated view of this is that the ex situ updates are best in larger, more sophisticated workflows. I say that because I think those workflows will be very much dominated by Xarray code with calls to individual sgkit functions interspersed within it, so the need for chaining sgkit operations is not very high and the need to transform/interrogate results before merging them in is higher.
It would be great to find some way to get all of those advantages above though. Perhaps there is a way to do something like this more elegantly:
A potential issue with this is that functions may rely on variables from other functions (see https://github.com/pystatgen/sgkit/pull/102#issue-465519140) and add them when necessary, which would mean the set of output variables isn't always the same. I don't think this is actually an issue with Dask as long as the definition of the variables is fast. It would be for most functions and Dask will remove any redundancies, so presumably we wouldn't have to necessarily return or add them (i.e. this affects both the in situ and ex situ strategies). Using numpy variables definitely wouldn't work that way but I'm not sure that use case is as practical outside of testing anyhow.
Some of @ravwojdyla's work to "schemify" function results would make it easy to know what variables a function produces (https://github.com/pystatgen/sgkit/issues/43#issuecomment-669478348). There could be some overlap here with how that works out too.