stephenslab / ebnm

R package to fit Empirical Bayes Normal Means model.
https://stephenslab.github.io/ebnm/
12 stars 9 forks source link

Add methods, accessor functions #85

Closed willwerscheid closed 1 year ago

willwerscheid commented 1 year ago

Address the following editor's comments:

Available methods for "ebnm" are:

methods(class = "ebnm")

[1] plot print

see '?methods' for accessing help and source code

This suggests that the set of available methods should be expanded. E.g., a summary or logLik method as well as accessor functions for example for the posterior summaries would seem appropriate. It might also seem preferable to be able to inspect the fitted functions G and compare them across different prior families.

We should add: summary (similar to what we already have for print) call logLik g fitted (for posterior means)? lfsr sampler

do we also need something for posterior sds?

stephens999 commented 1 year ago

maybe loglik instead of logLik since camel case is not our standard?

stephens999 commented 1 year ago

isn't ebnm$loglik , ebnm$posterior good enough for accessing these members?

willwerscheid commented 1 year ago

logLik is a stats function that is a method for lm, glm, multinom, etc, and is what people are used to I think

willwerscheid commented 1 year ago

I agree with you that accessors are pretty redundant here, but the editor (among other people I think) seems to have a stylistic preference for them, and there is the advantage that they show up in help(package = "ebnm"). It's not much work to include them so we might as well.

pcarbo commented 1 year ago

I agree with the editor that following R conventions makes the interface more accessible for users (e.g., coef, logLik, predict, fiitted). Some other standard S3 methods one could provide include: nobs, family, resid, residuals, confint, print, summary, and of course "plot".

willwerscheid commented 1 year ago

thanks @pcarbo . nobs would be straightforward. we have done print and plot already. predict might be interesting... just fix g from a previous object and run on new data...? so in sum I am thinking to implement:

"familiar" methods: summary logLik fitted (= posterior means) nobs predict

I am on the fence about the accessors. they are redundant as @stephens999 mentioned but there are other issues. lfsr doesn't make sense for some prior families and is not returned by default. sampler would be very awkward in that it returns a function so the syntax would look like sampler(ebnm.res)(nsamp = 10). not great. g is not very clear and should probably not be the main object of interest anyway. I think we should do more work in the paper to de-emphasize the importance of g and highlight posterior summaries instead.

pcarbo commented 1 year ago

Methods are essentially accessors.

In this case I don't see much added value in accessors for non-standard quantities such as lfsr.

Your "print" method should be replaced by a "summary" method, then you would implement "print.summary.ebnm". See for example this. Then you don't need a print method.

willwerscheid commented 1 year ago

@pcarbo here is a crack at what the print method will look like. What do you think?

> summary(pn.res)
Call:
ebnm_point_normal(x = x, s = s)

EBNM model was fitted to 1000 observations with _heteroskedastic_ standard errors.

The fitted prior is of class _normalmix_.
It includes a point mass at 0 with component weight equal to 0.56.

The likelihood for the model is -3091.58.
2 degrees of freedom were used to estimate the model.

Available posterior summaries: _mean, sd_.
A posterior sampler for the model _is not available_.
pcarbo commented 1 year ago

@willwerscheid Can you commit the changes to a dev branch and I'll have a look?

willwerscheid commented 1 year ago

@pcarbo The changes are in the main branch on my fork

pcarbo commented 1 year ago

I added "confint" to the list of S3 methods above.

pcarbo commented 1 year ago

I ran this:

> set.seed(1)
> theta <- c(rep(0,100),rexp(100))
> s <- 1
> x <- theta + rnorm(200,0,s)
> fit <- ebnm_point_normal(x,s)
> out <- summary(fit)
> str(out)
List of 9
 $ nobs               : int 200
 $ heteroskedastic    : logi FALSE
 $ prior_class        : chr "normalmix"
 $ pointmass_location : num 0
 $ pointmass_weight   : num 0.659
 $ log_likelihood     :Class 'logLik' : -355 (df=2)
 $ posterior_summaries: chr [1:2] "mean" "sd"
 $ sampler_included   : logi FALSE
 $ call               : language ebnm_point_normal(x = x, s = s)
 - attr(*, "class")= chr [1:2] "summary.ebnm" "list"
> print(out)
Call:
ebnm_point_normal(x = x, s = s)
EBNM model was fitted to 200 observations with _homoskedastic_ standard errors.
The fitted prior is of class _normalmix_.
It includes a point mass at 0 with component weight equal to 0.66.
2 degrees of freedom were used to estimate the model.
The likelihood is -355.47.
Available posterior summaries: mean, sd.
A posterior sampler is _not_ available.

Very nice so far.

My only comment is that it would be good to give code showing how to extract the posterior summaries mentioned.

willwerscheid commented 1 year ago

Thanks @pcarbo . I added instructions like you suggested. Here are the methods I have now:

summary and print, which you've already seen plot has been changed to handle multiple ebnm results; still plots x vs. \hat{\theta} (posterior means) nobs is length(x) logLik extracts the log likelihood fitted returns posterior means; there is a parameter means_only that can be set to FALSE to return a data frame with posterior means, sds, and lfsrs predict fixes g from an EBNM fit and uses it to get estimates for a new data set (x, s). It returns an ebnm object samp is a convenience function to get a sample using the posterior sampler confint constructs confidence intervals for \theta_i using the posterior sampler

Questions:

  1. Does the way I implemented fitted make sense? Is there a more traditional way to return sds along with the point estimates? I thought about using coef and vcov but the parameters here are really the parameters defining g.
  2. What about confint? confint is usually to get confidence intervals for the parameters returned by coef so I am kind of hijacking it for a difference purpose... does that seem ok?
  3. By default, should I return the entire ebnm object for predict or would it best to just return the posterior data frame (means, sds, lfsrs)? There is a parameter output so you would always have the choice but what is the more expected output?
pcarbo commented 1 year ago

A few comments:

In the normal-means problem, arguably there is no distinction between the parameter estimation and the prediction, which is why coef and fitted can do the same thing.

willwerscheid commented 1 year ago

Sounds good. I will make those changes, including changing samp to simulate. We can implement resid as x - \hat{\theta}. It is a little weird to think of those quantities as "residuals" but I could see why someone might be interested in them.

For confint I am just outputting the (1 - level) / 2 and 1 - (1 - level) / 2 quantiles, which is how lm/glm does it. We could do HPD -- it would be a bit more work but it would probably be more useful. What about implementing confint as HPD and then also implementing the S3 method quantile for people who want the first type of CI?

How about we return the 2/3-column dataframe for fitted, posterior means for coef, and posterior variances for vcov?

pcarbo commented 1 year ago

How about we return the 2/3-column dataframe for fitted, posterior means for coef, and posterior variances for vcov?

Sure.

For confint I am just outputting the (1 - level) / 2 and 1 - (1 - level) / 2 quantiles…

Could this be the same as HPD? Maybe you can check? This is my code for computing HPD intervals:

# Compute the highest posterior density (HPD) interval from a vector
# of random draws from the distribution. See Chen & Shao (1999) for
# background on HPD intervals.
hpd <- function (x, conf.level) {
  n <- length(x)
  m <- round(n*(1 - conf.level))
  x <- sort(x)
  y <- x[seq(n-m+1,n)] - x[seq(1,m)]
  i <- which.min(y)
  return(c(x[i],x[n-m+i]))
}
willwerscheid commented 1 year ago

yes that is exactly how I was thinking the code would work. thanks -- i will reuse this

willwerscheid commented 1 year ago

never mind, ignore deleted comment

willwerscheid commented 1 year ago

@pcarbo I think that I'd like to add an accessor function for "fitted_g" since there are S3 methods for every other field in the "ebnm" object. What would be the best way to do this? Should it be a S3 method or not? And are there naming conventions? For example would get_prior or prior be better (or just g)? I could not find any existing S3 methods that made sense for this. rstanarm has a prior_summary S3 method but "summary" might be a bit misleading.

pcarbo commented 1 year ago

I don't know of any convention. I'm not sure an S3 method is needed. If you output g in the summary function, perhaps that is sufficient?