Append output variables from functions to input dataset

eric-czech commented 4 years ago

Not every function will fit the fn(ds: Dataset, ...) -> Dataset signature, but for the large majority that do we have so far been adopting a convention that only returns newly created variables. Another option would be to always try to write those variables into the input dataset. For lack of a better phrase I'll call the former "Ex situ" updates and the latter "In situ" updates. Here are some pros/cons of each:

Ex situ updates

Advantages
- It makes it easier to transform/interrogate results before deciding to merge them
- Merging datasets in Xarray is trivial when there is no index but if result variables are indexed and the provided dataset is not, there is any argument to be made in leaving what to do up to the user
- Function style and it's less code to write/maintain
Disadvantages
- ds.merge(fn(ds)) is clunky if you wanted the variables merged in the first place
- It's not as nice in pipelines, e.g. you have to do this instead: ds = xr.merge([ fn1(ds), fn2(ds) ])

In situ updates

Advantages
- Pipelines are nicer, e.g. ds.pipe(count_alleles).pipe(allele_frequency)
Disadvantages
- We're likely to have some functions that produce 10s (or possibly 100s) of variables and users will only care about a few of them so fully merging all of them with a working dataset will almost never be the intended use (e.g. nirvana/vep)

My more opinionated view of this is that the ex situ updates are best in larger, more sophisticated workflows. I say that because I think those workflows will be very much dominated by Xarray code with calls to individual sgkit functions interspersed within it, so the need for chaining sgkit operations is not very high and the need to transform/interrogate results before merging them in is higher.

It would be great to find some way to get all of those advantages above though. Perhaps there is a way to do something like this more elegantly:

# Assuming `fn` makes inplace updates
ds_only_new_vars = fn(ds)[fn.output_variables]

A potential issue with this is that functions may rely on variables from other functions (see https://github.com/pystatgen/sgkit/pull/102#issue-465519140) and add them when necessary, which would mean the set of output variables isn't always the same. I don't think this is actually an issue with Dask as long as the definition of the variables is fast. It would be for most functions and Dask will remove any redundancies, so presumably we wouldn't have to necessarily return or add them (i.e. this affects both the in situ and ex situ strategies). Using numpy variables definitely wouldn't work that way but I'm not sure that use case is as practical outside of testing anyhow.

Some of @ravwojdyla's work to "schemify" function results would make it easy to know what variables a function produces (https://github.com/pystatgen/sgkit/issues/43#issuecomment-669478348). There could be some overlap here with how that works out too.

jeromekelleher commented 4 years ago

Nice summary, thanks @eric-czech. I'd be instinctively inclined to prefer your "ex-situ" pattern above, but I'm not an experienced enough xarray user to have a real opinion.

Something to consider: what if we have a common keyword arg to functions update=False which updated the input dataset with the result variables if True? It'll still return the new variables as before in this case, it's just whether the input is updated or not.

eric-czech commented 4 years ago

what if we have a common keyword arg to functions update=False

I like that. A way forward there that makes sense to me is:

Ensure all functions return Datasets
Write all functions assuming the ex situ approach, and create a Dataset that contains new variables
At the very end of each function, either return the new variables in a Dataset as-is (update=False) OR merge them with the provided dataset and return that result (update=True)
Make update=True the default

I would mildly recommend we try to get ahead of the fact that the merging can get complicated and consider something like a merge argument instead where it has one of 3 values: true, false, or kwargs for xr.merge.

EDIT

Actually, the more I think about using that the more I think it makes sense to either set update=False or have the function take full control over the merging.

jeromekelleher commented 4 years ago

I'm with you until (3) here @eric-czech:

At the very end of each function, return the new variables or merge them in based on the flag

I would say "At the very end of each function merge the new variables in based on the flag and return them". There's no reason not to return the variables in the update=True case, right? Or, do you mean that we return the input, updated dataset in this case? The semantics are bit simpler if we always return the same thing, aren't they?

I agree update=False should be the default. I'd find it surprising if an API modifies its arguments by default (although I hate to invoke the principle of least surprise!). Some (otherwise very sensible) folks get quite religious about functional patterns, so we don't want to turn them off the API on a matter of principle.

eric-czech commented 4 years ago

At the very end of each function merge the new variables in based on the flag and return them

Yea sorry, that's what I meant. Either you get the original dataset with the new variables added to it or you get a dataset just containing the new variables.

I agree update=False should be the default. I'd find it surprising if an API modifies its arguments by default

Me too. A pipeline would then look like ds.pipe(fn1, update=True).pipe(fn2, update=True). IIRC @tomwhite suggested trying to support that better twice now so it would be good to hear his thoughts on it. The best case I can think of for using pipelines in a GWAS workflow would be at the beginning when you attach all the summary stats so I haven't felt very strongly in favor of them over xr.merge([ ds, fn1(ds), fn2(ds) ]).

tomwhite commented 4 years ago

I agree update=False should be the default. I'd find it surprising if an API modifies its arguments by default

I don't think this is right. The merged dataset is a new object, so the original dataset is unchanged. (They share backing data arrays, of course, but these should be viewed as immutable.)

This is why I think there is a strong argument for making update=True the default, since for new users it's easy to explain that everything is accumulated in the newly returned dataset. This is the model that I have in my mind, possibly influenced by Scanpy, which uses Anndata containers. (And for rare operations that produce lots of new variables, the user can set update=False.)

Perhaps it should be called merge instead of update, to avoid the implication that the existing dataset is being mutated? And that would also echo xarray terminology.

eric-czech commented 4 years ago

I don't think this is right. The merged dataset is a new object, so the original dataset is unchanged

Ah yea agreed on that as long as we don't do ds[some_var] = ... in the functions. I meant it more like it's surprising to me that it seems like the API is making modifications to the original dataset. +1 to merge instead of update even if just for that reason.

jeromekelleher commented 4 years ago

Sounds good @tomwhite and @eric-czech - +1 to merge too.

So, just so I'm totally clear here, the proposed pattern is

ds2 = sgkit.some_function(ds1, merge=True)
ds3 = sgkit.some_function(ds1, merge=False)

We have:

(ds1 is not ds2) and (ds3 is not ds1)
ds2 is a shallow copy of ds1, with the extra variables computed by some_function merged in.
ds3 is a new dataset consisting of just the variables computed by some_function

Do I have this right? If so, SGTM, and yes, merge=True is the right default.

eric-czech commented 4 years ago

Do I have this right?

Exactly.

for new users it's easy to explain that everything is accumulated in the newly returned dataset. This is the model that I have in my mind, possibly influenced by Scanpy

Hail does it too and I agree it's one less thing for new users to learn. In the interest of moving this forward, it sounds like we're all in agreement on https://github.com/pystatgen/sgkit/issues/103#issuecomment-672091886 (the numbered steps a few posts above) with the only difference being we'll call the boolean flag merge.

timothymillar commented 4 years ago

What is being described here is very similar to the inplace argument in most pandas methods e.g. DataFrame.drop:

inplace : bool, default False
    If False, return a copy. Otherwise, do operation inplace and return None.

I think it's fair to assume that most users will have at least passing familiarity with pandas and hence replicating that behavior may be the most intuitive.

Pandas defaults to creating a new dataframe with the changes applied to that copy (not the input arg). The equivalent for an xarray object would be:

The merged dataset is a new object, so the original dataset is unchanged. (They share backing data arrays, of course, but these should be viewed as immutable.)

Edit: Though I'm not exactly certain on how pandas deals re shallow vs deep copies when creating the new DataFrame

timothymillar commented 4 years ago

Just to clarify, in @jeromekelleher example the default behavior ds2 = sgkit.some_function(ds1, merge=True) would be analogous to the default behavior in pandas of df2 = df1.some_method(inplace=False), which is probably quite intuitive for Pandas users.

But ds3 = sgkit.some_function(ds1, merge=False) is obviously completely different to df1.some_method(inplace=True) so I'm not suggesting using the same keyword.

jeromekelleher commented 4 years ago

Thanks @timothymillar, this is very helpful input and great to know.

tomwhite commented 4 years ago

it sounds like we're all in agreement on https://github.com/pystatgen/sgkit/issues/103#issuecomment-672091886 (the numbered steps a few posts above) with the only difference being we'll call the boolean flag merge.

+1

eric-czech commented 4 years ago

One other thing I'd like to bring up here is that ideas were floated in the past to add arguments to control the names of the variables that get created. Given that we want to standardize on a global naming convention and that some functions will produce a large number of variables, I would be mildly opposed to that.

One practical situation I can think of that could require customizing variable names is when trying to prefix or suffix them as part of some workflow that colocates the same variables from different sources. An example would be if I wanted a dataset containing both 1000 genomes and UKB variables aligned on some coordinates or index. I think the ability to "align" datasets like that is actually a really killer feature for sgkit and that it's also something that's supported well over multiple Datasets by xr.align or xr.sel with indexes (so customizing names isn't really necessary).

I can almost imagine some scenarios where something like a tf.name_scope would be helpful for controlling variable suffixes/prefixes, but it also seems fairly easy to work around with fn(ds[...].rename_vars(...), merge=False).rename_vars(...).

Anyways, I think it would be good to know where we all stand on that before making changes.

jeromekelleher commented 4 years ago

One other thing I'd like to bring up here is that ideas were floated in the past to add arguments to control the names of the variables that get created.

I don't really have an opinion. Let's build some stuff and think about making rules later, would be my take.

timothymillar commented 4 years ago

Related to this what happens if a value in the input-data set is re-calculated? will this be common? For example there are multiple methods for calculating expected heterozygosity, will every method produce a unique variable name or will they or result in the same variable name to ease downstream use?

If a user re-calculates 'heterozygosity_expected' (with merge=True) on a dataset that already contains a value of that name then they should probably see a warning something like:

MergeWarning: The following values in the input dataset will be replaced in the output: 'heterozygosity_expected'

The warnings can then be promoted to errors in 'production' pipelines ensure that there is no variable clobbering.

One other thing I'd like to bring up here is that ideas were floated in the past to add arguments to control the names of the variables that get created.

A simple minded approach to this would be to include the arguments map_inputs and/or map_outputs which each take a dict mapping the 'default' variable name to custom names.

ravwojdyla commented 4 years ago

Today I've been trying to create a PR for #43 and I'm in a little bit of a pickle. We currently have only a handful of computation functions, but already bunch of inconsistencies:

some functions return Dataset (gwas_linear_regression) whilst other DataArray (count_alleles)
some functions allow to optionally override some variables used in the implementation via function arguments (hardy_weinberg_test), others don't allow that at all count_alleles and others require to provide some of them (gwas_linear_regression)

I would argue that consistent API is something we should strive for (and we currently have only ~4 functions), I can see that there is already an argument made for the 1st issue here - and I fully support always returning Dataset. Regarding 2nd issue, do we always list all variables used (i/o maps), do we list some and when etc. I would argue we should try to find a way to pass that information as part of the dataset (in attributes/schema or wrapper)? Alternatively we could also allow to override all variable names for all functions. What do you think?

eric-czech commented 4 years ago

Alternatively we could also allow to override all variable names for all functions. What do you think?

TBH, what I'd personally like most as a user is:

All functions let me override ALL necessary input variables
No functions let me control output variable names, but return them in a separate a dataset so I can control renaming as well as merging and variable assignment semantics (to @timothymillar's clobbering point)

This would mean I can use whatever naming conventions I'd like for any variables in any context, and it would be ok if some of the input/output variables are conditional.

This contradicts the idea of a global namespace for variables (like I mentioned in https://github.com/pystatgen/sgkit/issues/87#issuecomment-672231814) but it would be most flexible. I see it as the polar opposite to any kind of "schemification" that would enforce the global names.

Regarding 2nd issue, do we always list all variables used (i/o maps), do we list some and when etc

I think if we go the schemification route, it may make sense to do away with any control the user has over variable names? The two extremes I see are fn(ds: TypedDataset) -> SomeOtherTypedDataset and fn(ds: Dataset, input_vars: Dict) -> Dataset # where ouput is mutually exclusive w/ input.

I completely agree we should find something more consistent!

tomwhite commented 4 years ago

Thanks for raising the inconsistencies @ravwojdyla. I think where it is required to specify variable names, that should be clear from required parameters in the function (e.g. covariates for gwas_linear_regression()), and if they are optional then the Python parameter would be optional (e.g. genotype_counts for hardy_weinberg_test()). So that seems OK to me at the meoment.

As a new user I would expect to be able to run a pipeline with all the defaults and have it just work. So that implies there is some kind of global namespace.

I like the Eric's idea of being able to override all input variables, but no output variables (user is responsible for that).

And +1 to making count_alleles return a Dataset.

hammer commented 4 years ago

Can this be closed with the merge of #217?

sgkit-dev / sgkit

Append output variables from functions to input dataset #103

Ex situ updates

In situ updates