related-sciences / gwas-analysis

GWAS data analysis experiments
Apache License 2.0
24 stars 6 forks source link

Fix #33. Handle missing values in PackGeneticBits #34

Closed ravwojdyla closed 4 years ago

ravwojdyla commented 4 years ago
eric-czech commented 4 years ago

Thanks @ravwojdyla! I'll take a look.

eric-czech commented 4 years ago

All looks good to me though I think the comment on missing value handling in general is something we should make a top priority. Don't see any reason to make it a blocker on these changes though.

hammer commented 4 years ago

I think the comment on missing value handling in general is something we should make a top priority

Start a discussion on SMAD Stat Gen?

ravwojdyla commented 4 years ago

@eric-czech thanks for all the context, it's great to know! As I was reading different parts of the code I noticed the use of masks and sentinel values and I was a bit confused. It makes a bit more sense now.

Regarding the different options, I am +1 to @hammer suggestion to open a topic on SMAD about this. And my opinion about this problem:

Do we always store data like this as uint knowing full well that the missing values will encode to some nonsense value, possibly within the domain of non-missing values? It sounds dangerous but I'd argue that the danger introduced by needing a sentinel strategy for every single data type is similarly dangerous.

Definitely sounds dangerous, and like in this PR might lead to different values of an array following a serde cycle (which one would expect to be deterministic and return exactly the same array), it's not hard to image that it might lead to confusion if there is bug somewhere further down the pipeline that would for example only exhibit itself on array after serde (which actually was the case in my 1kGP pipeline).

eric-czech commented 4 years ago

I am +1 to @hammer suggestion to open a topic on SMAD about this we ended up with a performant underlying data structure (not necessarily typed/safe) and a well typed/safe API to safeguard the user from making mistakes or understanding the details like encoding of missing values

Want to propose that then in a SMAD topic? I'm happy to write it up too at some point, but it would be great to get your thoughts on past experiences.

pure numpy arrays with sentient values for missing data and try to keep it away from the user unless they reach out for it (and even then missing values should be properly tie to its container type)

These seem like contradictory statements if I'm not missing something, since the use of sentinels makes the proper container type (I'm assuming we're talking about dtype) impossible. How do you see solutions like this working with quantized floats, booleans or arrays with indexes instead of counts? Or with counts like this that should be unsigned?

Also, I'm not sure I follow what you mean on how sentinel values keep missingness away from the user. For example in NEP-26, which summarizes all the different proposals for handling missing data, that's essentially the only solution that isn't trying to hide the details of missingness from the user to at least some degree:

if you use the “-99” approach then it’s a pitfall you have to remember to check for explicitly on literally every calculation you ever do

Definitely sounds dangerous

It is though exactly how numpy and Astropy masked data structures work. Note to self to look into this more: https://docs.astropy.org/en/stable/table/masking.html

and like in this PR might lead to different values of an array following a serde cycle (which one would expect to be deterministic and return exactly the same array)

If we use anything other than sentinels, the code would always return the same thing on a serde cycle if we fixed my mistake in letting values ever be negative and demoting uint8 to int8 before running through an encoder that assumed uints.

the cost of dec[dec == 3] = -1 is likely pretty low in comparison to the IO costs

Agreed but for any math on the arrays, like converting calls to allele counts, the sentinel values will change (-1 for calls would become -2 in the counts for biallelic diploid data) so we'd have to also do this at the end of every operation if we wanted to maintain the same sentinel value throughout a pipeline including IO.

ravwojdyla commented 4 years ago

@eric-czech sure, I can submit that proposal to SMAD.

pure numpy arrays with sentient values for missing data and try to keep it away from the user unless they reach out for it (and even then missing values should be properly tie to its container type)

These seem like contradictory statements if I'm not missing something, since the use of sentinels makes the proper container type (I'm assuming we're talking about dtype) impossible. How do you see solutions like this working with quantized floats, booleans or arrays with indexes instead of counts? Or with counts like this that should be unsigned?

Sorry I wasn't clear there, the numpy arrays with sentient values would be encapsulated inside a type safe container/type that would understands the notion of missing/sentient values for that specific type. For example:

# FYI1: I'm skipping integration with xarray/datasets here to simplify the example
# FYI2: also I haven't spend much time thinking about this API, just want to illustrate the idea:

T = TypeVar("T")

class GTData(Generic[T]):
    # in most cases users should not access the underlying data and instead use
    # typed API, nevertheless we should allow users to access this data for
    # cases not covered by the API
    def _unsafe_get_date(self) -> "array":  # this would return numpy, dask etc array)
        ...

    @classmethod
    def _unsafe_from_data(cls, array) -> "GTData[T]":
        ...

# This mixin could also be a mixin, it doesn't matter for the sake of the example:
class GTMissingData(GTData[T]):
    missing_value: T
    ...

class AlleleCountGT(GTMissingData[T]):
    missing_value: np.int8 = -1
    ...

class CallGT(GTMissingData[T]):
    missing_value = -2
    ...

The high level API would work on these typed containers and provide ready to use functions, whilst still allow to access the low level unsafe storage for special cases, in which case it's the user's duty to operate on numpy arrays in a safe way (but this should not happen often). The implementation of typed methods would have to take into the account missing values and handle them, but that would be hidden away from the user, and we could even have a common helper methods on GTMissingData to work on data with missing values which would essentially delegate where applicable.

The more I think about it the more I think maybe it would make sense to start with sentinel values, and if we see that masks are needed or more optimal in some cases it would be easy to add in the example above (even as a mixin), potentially even having it lazily computed if needed.

eric-czech commented 4 years ago

The more I think about it the more I think maybe it would make sense to start with sentinel values

Why though? I'm following what you're proposing but I can't see any advantages to it over the alternatives in NEP-26. A common missing sentinel like np.NA makes sense to me (the best solution if it existed IMO), and a separate boolean mask makes sense (particularly since it's already supported in numpy), but making a different missing sentinel for our data structures means we can't use any generic, nan-aware array ops shared by both CallGT and AlleleCountGT.

ravwojdyla commented 4 years ago

@eric-czech I definitely agree - a common missing sentinel like NA would be best if existed. I guess what I am leaning more towards is to start somewhere, preferable with a simpler solution first, see where it breaks or underperform and optimize later. And here I want to point out that the missing_value could be np.nan, and thus we could leverage the nan-aware ops for float based arrays. I think a lot of these decisions can be iron out as we try to implement different use cases, and things will likely change along the way.