mitchelloharawild / distributional

Vectorised distributions for R
https://pkg.mitchelloharawild.com/distributional
GNU General Public License v3.0
94 stars 15 forks source link

median() and quantile() will error if passed na.rm parameter #72

Closed mjskay closed 2 years ago

mjskay commented 2 years ago

Unlike mean.distribution(), currently median.distribution() and quantile.distribution() will give errors if explicitly passed na.rm:

> x = dist_normal()
> mean(x, na.rm = TRUE)
[1] 0
> median(x, na.rm = TRUE)
Error in stats::qnorm(p, x[["mu"]], x[["sigma"]], ...) : 
  unused argument (na.rm = TRUE)
> quantile(x, 0.5, na.rm = TRUE)
Error in stats::qnorm(p, x[["mu"]], x[["sigma"]], ...) : 
  unused argument (na.rm = TRUE)

This makes the API a bit more cumbersome to use in ggdist, as I will have to add a special case for median when used with numerics versus when used with distributions. Could the distribution implementation of these functions ignore na.rm? In the case of median at least, the base generic includes na.rm in its signature so it should probably support it.

(also apologies for all the issues: I've been kicking the tires of {distributional} again as I prepare to increase its support and use in ggdist)

mitchelloharawild commented 2 years ago

I've improved handling of na.rm in median() - and found as you indicate that median.distribution() doesn't use distribution specific median methods :facepalm:.

I'm not so sure about quantile() methods, at least on an implementation front. Ignoring na.rm for non-sample distributions would require adding the argument to each method. Currently quantile(<dist_sample>) has na.rm=TRUE by default, however I think this should be changed for R consistency and NA awareness.

Any suggestions on how to handle NAs for quantile() functions?

(also apologies for all the issues: I've been kicking the tires of {distributional} again as I prepare to increase its support and use in ggdist)

Love them! :+1:

mjskay commented 2 years ago

I've improved handling of na.rm in median() - and found as you indicate that median.distribution() doesn't use distribution specific median methods 🤦.

Sweet thanks!

I'm not so sure about quantile() methods, at least on an implementation front. Ignoring na.rm for non-sample distributions would require adding the argument to each method. Currently quantile() has na.rm=TRUE by default, however I think this should be changed for R consistency and NA awareness.

Any suggestions on how to handle NAs for quantile() functions?

Hmm yeah. For me, support for na.rm in quantile is less important than median because I already have to special-case it a bit with distributional objects anyway, so there isn't a reduction in complexity in my code by supporting it (unlike with median, where having na.rm support helps a lot). So if the resolution is not to support it in quantile() I won't be torn up about it :).

One question I did have is: is it is even possible to create dist_sample() objects that contain NAs? I wasn't able to directly:

> dist_sample(list(c(1:10, NA)))
<distribution[1]>
[1] sample[10]
Warning message:
Missing sampled values have been removed from the sample distribution. 

In which case, maybe na.rm is a meaningless parameter even for dist_sample? In which case, you could consider having quantile.distribution() take na.rm but not do anything with it just for compatibility's sake.

mitchelloharawild commented 2 years ago

One question I did have is: is it is even possible to create dist_sample() objects that contain NAs? I wasn't able to directly

Not at the moment, but I was considering keeping NA values in sample distributions. Do you think preserving NAs in input and output for sample distributions is important, or remove them upfront with a warning as is currently done. Pinging @njtierney for thoughts on missing values in distributions.

mjskay commented 2 years ago

One possible argument in favor of keeping them is for cases where samples are correlated (e.g. in output from MCMC). Then you don't want to drop NAs because this will change the indices, and the index for each draw actually has some meaning (e.g. what MCMC iteration that draw came from). If you plan to support arithmetic on dist_samples by applying the operations elementwise to the underlying arrays, dropping NAs could then cause incorrect results.

njtierney commented 2 years ago

My thoughts on this are that it would be awesome to have distributions that can have missing values as valid values that could be drawn. My thoughts went to the idea of inflating a distribution with NA values, in a similar way to how there might be 0 inflated, there could be NA inflated distributions...but I feel like there might be unintended consequences of this?

mitchelloharawild commented 2 years ago

Sounds convincing to me. I've changed dist_sample() to allow missing values.

mjskay commented 2 years ago

In that case, I'd say quantile should not error on other distributions when passed na.rm, otherwise (unless I'm mistaken) something like quantile(c(dist_normal(), dist_sample(list(c(1:10, NA)))), 0.5, na.rm = TRUE) would fail?

mitchelloharawild commented 2 years ago

Yes, I agree that na.rm should be handled. Sadly I think this requires all distribution functions to have a na.rm argument that is ignored.

mjskay commented 2 years ago

Heh. You know it's a mature R API when you're forced to pass annoying parameters around

mitchelloharawild commented 2 years ago

Thanks for the suggestion - turns out in most situations it doesn't make sense to pass ... through to p/d/q/r functions (and in most methods I'm not doing it anyway). So there were only a few methods that needed updating for this.

mjskay commented 2 years ago

Thanks for the fix! On the latest version (0.3.0) this is working for me except on dist_mixture(), which sometimes calls down to cdf() and then fails. E.g.:

library(distributional)
x = dist_mixture(dist_normal(0), dist_normal(1), weights = c(0.5, 0.5))
quantile(x, 0.5, na.rm = TRUE)
#> Error: 1 components of `...` were not used.
#> 
#> We detected these problematic arguments:
#> * `na.rm`
#> 
#> Did you misspecify an argument?
#> Error: 1 components of `...` were not used.
#> 
#> We detected these problematic arguments:
#> * `na.rm`
#> 
#> Did you misspecify an argument?

Created on 2022-01-11 by the reprex package (v2.0.1)

Should cdf() also now support na.rm, given the support for NAs in dist_sample()?

mitchelloharawild commented 2 years ago

Hmm, this can occur with all uses of cdf(), and a few other functions like likelihood(). Yes cdf() should support na.rm, and it does. However it does by ignoring na.rm in ... for most methods, so ellipsis isn't pleased. Simple solution would be to drop ellipsis from the generics - if people want to check dots for usage, they can wrap their own function around the generic.

mitchelloharawild commented 2 years ago

Fixed in 646e9df8923feaad1c399f245f16392c94cc76fe

library(distributional)
x = dist_mixture(dist_normal(0), dist_normal(1), weights = c(0.5, 0.5))
quantile(x, 0.5, na.rm = TRUE)
#> [1] 0.5
cdf(dist_normal(), 1, na.rm=FALSE)
#> [1] 0.8413447

Created on 2022-01-12 by the reprex package (v2.0.0)

mjskay commented 2 years ago

thanks for the quick fix!