Open tomwhite opened 4 years ago
One possibility could be to:
create_*_dataset
function, make a check_*_dataset
function that checks the dimensions and data types for all fields added in create_*_dataset
, only if they're present
sg.count_alleles
would still work if a field like variant/id
isn't present, since it's not neededcall/genotype
isn't present, the check_genotype_call_dataset
function would not catch it and instead a KeyError would be thrown by Xarray when referenced (with the name of the missing variable)A solution I like a little more would be to:
check_dataset
function that can check all datasets and only looks to assert dtypes/dimensions on "managed" variables (also ignoring those that aren't present), or ones that we know should have a particular interpretationcheck_dataset
to support variable name remapping through some kind of optional dictA a third possibility that I think I like the most at the moment would be to:
get_variable(ds: Dataset, var: Hashable)
function that also has a sense of what "managed" variables there are that we are being opinionated about. It would then check dtypes/dims (and possibly other things in the future) based on the name of the variable and default to returning it from ds
directly if not managedcheck_*
call at the beginning of each internal method, only a preferred way to access variables that offers greater safetyDo any of those jump out at you?
Thanks for outlining the possibilities @eric-czech. All of these sound like they are lightweight checks of variable names, dtypes, shapes etc - so they could be performed as invariant checks internally before (or after) each operation.
I was thinking more about checks that are not so lightweight, and which might be better exposed as functions that the user could run. E.g. checking that probabilities sum to one. Are there any others?
Ah sorry, here are a few of the more expensive checks I think would be common:
variant/contig
are sortedvariant/id
are unique (or null)variant/pos
are all not nullvariant/contig
, variant/pos
, and variant/alleles
is unique
call/genotype
do not exceed alleles
dimension introduced most likely by 'variant/alleles` -- that correspondence is somewhat brittle and I was proposing we rely on it in things like this: https://github.com/pystatgen/sgkit/pull/36/files#diff-e2edcf5f78ccf0e17093a6cbe4afeda6R47ploidy
ploidy
] or np.nan
-- I can imagine a scenario where our -1 sentinel could leak through into the dosage sumsI'll add any more I run into, but I think most of those would be quite useful.
We've discussed the need to check that values in dataset arrays conform to certain invariants before or after running methods on them (e.g. https://github.com/pystatgen/sgkit/pull/36#discussion_r452950827). These might be non-trivial in performance time (i.e. not just checking metadata), so would be opt-in by users.
This issue is to add such a function. @eric-czech do you have a suggestion for a function we could start with here?