coding sprint plan 2020

goldingn commented 4 years ago

I'm just starting out on a two-month (with some interruptions) coding sprint on greta. There are a bunch of bugfixes, half-finished (and some half-baked) features, and some required refactoring that I have been putting off because 2019 was a pretty busy year. I'm hoping to make some headway on some of these major issues and features I've been promising for a while.

I'll keep chipping away at things on this list, which I will modify as I go. The list is just here for me to keep track of things to do, and for others to follow along and comment if there are features they are particularly interested in. The order here is neither in order of priority nor the order in which I intend to work on things. I'll release new versions as I go, but I'm not sure yet which features and fixes each release will contain.

During this work, I'll also be trying to keep on top of other issues that come up and the forum, which I've been neglecting over the Christmas break. Now's a good time to ask questions over there :)

1. Bugfixes & misc

This is a subset of the open issues that I'm particularly keen to fix, and have a plan for. This doesn't mean the other open issues aren't important as well, and I'll try to get to some of those too.

[ ] 1.1 Fix thinning (to do after 2.3) #318
[ ] 1.2 Speed up subsetting #309
[x] 1.3 Array-scalar dispatch #298
[ ] 1.4 Automated decentreing of distributions (requires delayed graph definition or extended representations interface, also applies to greta.gp for Gaussian GPs) #47
[x] 1.5 Fix/exclude linting issues
[x] 1.6 Enable memory-safe (batched) prediction in calculate() #236
[x] 1.7 Make the output of mcmc() have its own class, inheriting from coda's mcmc class, and provide methods for post-hoc windowing, thinning etc.
[x] 1.8 Fix silently erroring primitive functions #317

2. Compatibility

A number of things have changed in the interfaces to TF and TFP (particularly in moving to TF 2.0). greta currently still works with the compatibility functions, but some refactoring is needed to fully support these.

[x] 2.1 Refactor distributions code to be closer to TFP (also enables greta.hmm to use the TFP hmm functionality)
[ ] 2.2 Refactor internals to use TF functions, not graphs, for full compatibility with TF 2.0
[ ] 2.3 Use TFP sampler adaptation (with progress bar updating)

3. Marginalisation

There's an incomplete branch feature/marginalise that implements an interface for marginalisation of discrete random variables in a greta model. There's some work towards marginalisation of a priori multivariate normal variables via the Laplace approximation too. This all needs polishing up, thorough testing, documenting and releasing.

[ ] 3.1 Implement marginalisation interface #157
[ ] 3.2 Check discrete marginalisation
[ ] 3.3 Check and fix Laplace marginalisation
[ ] 3.4 Plan general-purpose variational marginalisation
[ ] 3.5 Write documentation that exaplains what this feature is, since it will be unfamiliar to most users of MCMC software.

4. Sampling discrete variables

This has been on the to-do list for a long time. It will require a bit of refactoring and redesigning internals, but there's nothing about sampling of discrete random variables that should bee particularly tricky to implemnent.

[ ] 4.1 Implement discrete variables
[ ] 4.2 Implement discrete-only samplers
[ ] 4.3 Implement Gibbs sampling between discrete and continuous parameter spaces.

5. Simulation

Random independent sampling from a model object, optionally conditionally on fixed values or posterior samples, is a much-requested feature that needs a surprising amount of engineering in the background, and careful thinking about an intuitive interface. There's some existing work that just needs implementing, polishing up, testing and documenting, along with some examples of postreior predictive checks etc.

[x] 5.1 Implement simulation interface via calculate(), following the discussion and proposed interface here (#342)

6. Continuous integration & TF versions

greta versions are now being tied to specific releases of TF and TFP. I was trying for a while not to do this, because I believe it's best practice not to be overly prescriptive about dependencies. However both TF and TFP are evolving fast and regularly introduce breaking changes. It would be good to catch these changes early with CI testing on the nightly releases of those dependencies.

[ ] 6.1 Set up a continuous integration grid that reflects the specific dependencies of all currently available and recent releases of greta
[ ] 6.2 Add a table (greta version vs. required TF and TF versions) of badges to the readme with the test results for these versions
[ ] 6.3 Add an entry for the dev version of greta agains both stable and nightly versions of TF and TFP

jeffreypullin commented 4 years ago

Hi Nick,

What an impressive and exciting list of features/fixes!

I just wanted to let you know that I currently have some time on my hands at the moment (at least until the start of March when I start my MSci at Melbourne) and would really like to contribute a bit more to greta.

Let me know if there is anything you would particularly like me to work on - happy to discuss!

Cheers, Jeffrey

goldingn commented 4 years ago

Hi Jeffrey, that would be great! Good timing.

A couple of things spring to mind:

I've done most of 1.7 above, implemented on a branch called greta_mcmc_list_class. That makes it possible to define custom methods for printing, plotting, and summarising the output of mcmc() (but otherwise falling back on coda's methods for mcmc.list objects). You mentioned a while ago that you thought it would be useful to have a print method, and possibly a plot method that provided more information to assess convergence of the model (e.g. R-hat statistics, effective sample size etc.), similarly to the output in rstan. If you were interested in putting something like that together as e.g. a print.greta_mcmc_list() function, that would be great.

I think I'll focus next on tasks 2 and 5, which is something you looked at previously. Thee way I'm planning to do that it should be possible to use the TFP distribution objects more naturally, calling on their IID sampling methods as well as their log densities etc. There may well be some distributions (e.g. greta's 'mixture' and 'joint' distributions) that will need IID sampling algorithms coded up - we should keep in touch about that in the relevant issues.

Nick

goldingn commented 4 years ago

FYI, the branch mentioned above is now merged into master

jeffreypullin commented 4 years ago

Cool, I'll work first on the print, summary etc. methods then - I should be able to get to it sometime after Thursday.

goldingn commented 4 years ago

The simulation interface turned out to be a huge job, but merged into master now!

I don't know of any other statistical modelling software that lets you define the generative model once, and then enables IID sampling from the prior, sampling from the posterior, sampling of data conditional on the posterior (or on fixed values for parameters) and posterior prediction to new data. So I'm pretty pleased :)

lionel68 commented 4 years ago

Am also interested to develop methods for greta_mcmc_list_class, @jeffreypullin how far are you down this road? Would you need a hand?

jeffreypullin commented 4 years ago

Hi @lionel68,

Please feel free to take over the implementation - it's turned out that I haven't had as much time to work on greta as I had hoped...

I made some initial attempts but I think they are now out of date due to the simulation update etc.

I'd be happy to share ideas or otherwise collaborate if that would be helpful.

lionel68 commented 4 years ago

Concerning point 1.7 from above (class and methods from mcmc object), the Stan people are putting together a package to standardize the output of Bayesian models: https://mc-stan.org/posterior/, would it be an option to fix the output of mcmc to that to tap into the large resources that these guys are developing?

greta-dev / greta