Stan 2.33: Move IO munging to external package, refactors

WardBrian commented 1 year ago

Submission Checklist

[x] Run unit tests
[x] Declare copyright holder and open-source license: see below

Summary

IO changes for 2.33 (tuple support).

use https://github.com/WardBrian/stanio (tbd: will this move under stan-dev?)
- support writing tuples to json
- support reading/extracting tuples from draws array
breaking changes:
- internals of InferenceMetadata changed (not technically part of our public API, but it was mentioned some places in our docs)
- VB's .stan_variable function now returns the varational draws, not the mean approximation (same as #652)
- MLE and VB always return a np.ndarray from .stan_variable, never just a Python float

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company):

Simons Foundation

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)

codecov-commenter commented 1 year ago

Codecov Report

Merging #681 (7945536) into develop (107a347) will decrease coverage by 0.60%. The diff coverage is n/a.

@@             Coverage Diff             @@
##           develop     #681      +/-   ##
===========================================
- Coverage    80.60%   80.01%   -0.60%     
===========================================
  Files           72       72              
  Lines        11292    10951     -341     
===========================================
- Hits          9102     8762     -340     
+ Misses        2190     2189       -1

see 40 files with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

WardBrian commented 1 year ago

@mitzimorris - I think this is ready to review

It is probably hard to review without simultaneously "reviewing" stanio

WardBrian commented 1 year ago

The breaking change is also the different return type of stan_variable for optimize and variational

mitzimorris commented 1 year ago

I took a look at the FBProphet code, and it doesn't use stan_variable - although it runs optimization, it's pulling things back as a np array. ~~I'm willing to go with a breaking change.~~ the other place to check is ArviZ. checking now

WardBrian commented 1 year ago

I don't think arviz touches anything other than CmdStanMCMC

mitzimorris commented 1 year ago

regarding moving stanio under stan-dev - should we create repo stanio with subdirs according to language - python, r, c++?

WardBrian commented 1 year ago

I don't know enough about the R ecosystem to know if they need this - I think rvars/posterior cover a lot of the existing use-cases, but of course tuples change things.

C++ I think it would be very difficult to write something like this, since the return type of the extracting functions depends on the data

mitzimorris commented 1 year ago

I think it's OK to make this a repo under standev just for the python io used by CmdStanPy and BridgeStan and future Python packages.

tillahoffmann commented 1 year ago

Would it make sense to add a dtype argument somewhere in the data loading, e.g., as an extra argument to CmdStanModel.{sample,optimize,...} to specify the dtype returned by stan_variable? This could be useful for reducing the memory footprint of larger models. For example, to evaluate WAIC for the election88 dataset of CBS News polls, the log_lik variable has num_draws_sampling * chains * 11565 elements. Using cmdstanpy defaults that amounts to ~ 740 MB or ~ 1.5 GB if we include the logits for the binary outcome. Not enormous, but my laptop wasn't too happy when I compared several models.

This may not be the right thread, but thought I'd mention it in case it affects the data munging.

WardBrian commented 1 year ago

@tillahoffmann I think that's worth its own issue. It is sort of orthogonal to this issue, since the transformations we apply here are agnostic to what the dtype of the draws are. If we changed what we were reading it from the files, these should propagate nicely

stan-dev / cmdstanpy