Update case studies to use new language syntax

avehtari commented 1 year ago

With Stan 2.33+ several old language syntax features produce errors. All the case studies would be good to update to use the latest syntax. Many case studies are in external repos and the authors have submitted only the rendered html and short md-part for the case study contents page. Only the html needs to be updated in users/documentation/case-studies/.

It would be good o contact the original authors and ask them if they are willing to update their repos and submit a new html. If the authors disagree or don't respond, we may consider updating just the syntax on html.

To start the process, I'm listing here all the case studies, and we can start tracking which have been fixed. Tagging also some authors that were easily found by github id autocomplete @mitzimorris, @WardBrian, @bob-carpenter, @charlesm93, @bbbales2, @imadmali

[x] Bayesian Structural Equation Modeling using blavaan: Feng Ji, Xingyao Xiao, Aybolek Amanmyradova, Sophia Rabe-Hesketh
[x] Multilevel regression modeling with CmdStanPy and plotnine: Mitzi Morris
[X] HoloML in Stan: Low-photon Image Reconstruction: Brian Ward, Bob Carpenter, and David Barmherzig
[ ] Bayesian Latent Class Models and Handling of Label Switching: Feng Ji, Aybolek Amanmyradova, Sophia Rabe-Hesketh
[x] Bayesian model of planetary motion: exploring ideas for a modeling workflow: Charles Margossian and Andrew Gelman
[x] HMM Interface Example: Ben Bales
[ ] Spatial models for plant neighborhood dynamics in Stan: Cristina Barber, Andrii Zaiats, Cara Applestein and T.Trevor Caughlin
[ ] Predicting Engine Failure with Hierarchical Gaussian Process: Hyunji Moon, Jungin Choi
[x] Upgrading to the new ODE interface: Ben Bales, Sebastian Weber
[ ] Bayesian Workflow for disease transmission modeling in Stan: Leo Grinsztajn, Elizaveta Semenova, Charles C. Margossian, and Julien Riou
[ ] Reduce Sum Example: parallelization of a single chain across multiple cores: Ben Bales
[x] Stan Notebooks in the Cloud: Mitzi Morris
[ ] Model-based Inference for Causal Effects in Completely Randomized Experimen: JoonHo Lee, Avi Feller and Sophia Rabe-Hesketh
[ ] Tagging Basketball Events with HMM in Stan: Imad Ali
[x] Model building and expansion for golf putting: Andrew Gelman
[ ] A Dyadic Item Response Theory Model: Stan Case Study: Nicholas Sim, Brian Gin, Anders Skrondal and Sophia Rabe-Hesketh (note: source link points to fork of example-models)
[x] Multilevel Linear Models using Rstanarm: JoonHo Lee, Nicholas Sim, Feng Ji, and Sophia Rabe-Hesketh
[ ] Predator-Prey Population Dynamics: the Lotka-Volterra model in Stan: Bob Carpenter
[ ] Nearest neighbor Gaussian process (NNGP) models in Stan: Lu Zhang
[x] Extreme value analysis and user defined probability functions in Stan: Aki Vehtari
[ ] Modelling Loss Curves in Insurance with RStan: Mick Cooney
[ ] Splines in Stan: Milad Kharratzadeh
[x] Spatial Models in Stan: Intrinsic Auto-Regressive Models for Areal Data: Mitzi Morris
[x] The QR Decomposition for Regression Models: Michael Betancourt
[ ] Robust RStan Workflow: Michael Betancourt
[ ] Robust PyStan Workflow: Michael Betancourt (also uses PyStan 2 which is no longer supported)
[x] Typical Sets and the Curse of Dimensionality: Bob Carpenter
[ ] Diagnosing Biased Inference with Divergences: Michael Betancourt
[ ] Identifying Bayesian Mixture Models: Michael Betancourt
[x] How the Shape of a Weakly Informative Prior Affects Inferences: Michael Betancourt
[x] Exact Sparse CAR Models in Stan: Max Joseph
[ ] A Primer on Bayesian Multilevel Modeling using PyStan: Chris Fonnesbeck (also: rendered HTML was deleted?)
[ ] The Impact of Reparameterization on Point Estimates: Bob Carpenter
[ ] Hierarchical Two-Parameter Logistic Item Response Model: Daniel C. Furr
[ ] Rating Scale and Generalized Rating Scale Models with Latent Regression: Daniel C. Furr
[ ] Partial Credit and Generalized Partial Credit Models with Latent Regression: Daniel C. Furr
[ ] Rasch and Two-Parameter Logistic Item Response Models with Latent Regression: Daniel C. Furr
[ ] Two-Parameter Logistic Item Response Model: Daniel C. Furr, Seung Yeon Lee, Joon-Ho Lee, and Sophia Rabe-Hesketh
[ ] Cognitive Diagnosis Model: DINA model with independent attributes: Seung Yeon Lee
[ ] Pooling with Hierarchical Models for Repeated Binary Trials: Bob Carpenter
[ ] Multiple Species-Site Occupancy Model: Bob Carpenter
[ ] Soil Carbon Modeling with RStan: Bob Carpenter

avehtari commented 1 year ago

Tagging more authors @betanalpha, @danielcfurr, @hyunjimoon, @education-stan, @Cristinabarber, @joonho112, @LuZhangstat, @kaybenleroll, @milkha, @mbjoseph, @fonnesbeck

mitzimorris commented 1 year ago

in the interim, we could insert a paragraph at the top of the old case studies saying that the code is using the old syntax and instructing the reader to run the stanc canonicalizer on the code themselves.

exercises to the reader are less work than exercises to the author.

hyunjimoon commented 1 year ago

Just an idea, but it would be handy if chatgpt can auto-translate old casestudies with old syntax (e.g. python 2.7) to new syntax (python 3.10)? Python https://docs.python.org/3/library/2to3.html seems to hand-coded this translation.

mitzimorris commented 1 year ago

we don't need chatGPT.

please get the latest release of Stan, and then do (something like this)

> /path/to/cmdstan/bin/stanc --print-canonical my_file.stan > new.tmp
> diff -y -W 180 my_file.stan new.tmp
> mv new.tmp my_file.stan

that diff command will show files side-by-side - it's an easy way to check that stanc did the right thing and only the right thing.

update: for some reason the above procedure is adding an extra newline to files. @WardBrian does the canonicalizer always add a newline proactively to its output in case the input was missing one?

jgabry commented 1 year ago

in the interim, we could insert a paragraph at the top of the old case studies saying that the code is using the old syntax and instructing the reader to run the stanc canonicalizer on the code themselves.

Yeah this sounds like a good idea until these are updated.

exercises to the reader are less work than exercises to the author.

Exercises to the author require doing once and all readers benefit. Exercises to the reader require doing N_readers times. So the latter requires a lot more work overall, just less work for the author. Or am I misunderstanding what you meant?

that diff command will show files side-by-side - it's an easy way to check that stanc did the right thing and only the right thing.

Nice!

WardBrian commented 1 year ago

I manually went through the ones which were unclear and figured out if they needed updating or not. That brings the total up to 11/42 being good to go - either because they used the new syntax, didn't use any of the old syntax, or (in a few cases) contained no actual stan code in the text of the case study.

It's also worth noting that any case study which stored it's code in the example-models repo had its code automatically updated a while back. If any of those case studies are using something like writeLines(readLines("model.stan")), then the only work that actually needs to be done is just re-kniting. More than a few seem to store the code in a string or text block in the markdown, however.

bob-carpenter commented 1 year ago

@hyunjimoon : It's going from the old Stan syntax to the new Stan syntax. ChatGPT(4) is pretty good at Python, but it's very bad at Stan.

bob-carpenter commented 1 year ago

If we keep our User's Guide, Reference Manual, and Functions Reference up to date, I don't think breaking the old case studies should block any of our updates. Specifically, I'm OK putting a warning up and then fixing them as we can. Another alternative is moving the ones that aren't updated to a "deprecated case study" location and flagging them up front.

I can update the five of my case studies that weren't built with the new Stan syntax:

Predator-Prey Population Dynamics: the Lotka-Volterra model in Stan: Bob Carpenter
Pooling with Hierarchical Models for Repeated Binary Trials: Bob Carpenter
The Impact of Reparameterization on Point Estimates: Bob Carpenter
Multiple Species-Site Occupancy Model: Bob Carpenter
Soil Carbon Modeling with RStan: Bob Carpenter

jgabry commented 1 year ago

If we keep our User's Guide, Reference Manual, and Functions Reference up to date, I don't think breaking the old case studies should block any of our updates. Specifically, I'm OK putting a warning up and then fixing them as we can. Another alternative is moving the ones that aren't updated to a "deprecated case study" location and flagging them up front.

I agree that we shouldn't hold up Stan releases just because they break case studies. A warning about it would be good. Right now the website says:

The case studies on this page are intended to reflect best practices in Bayesian methodology and Stan programming

which is a bit unfortunate since best practices would include code that doesn't error.

What if we change the note at the top to say this?

The case studies on this page are intended to reflect best practices in Bayesian methodology and Stan programming. We aim to keep them current with the latest version of the Stan language, but there may be times when case studies need updating to reflect the latest Stan features and syntax.

That could probably be worded better, but something along those lines?

bob-carpenter commented 1 year ago

That wording sounds good. Did we want to point people to the Stan code updater in stanc3?

jgabry commented 1 year ago

Did we want to point people to the Stan code updater in stanc3?

The only reason I'd hesitate to do that is that on slack @WardBrian mentioned that in future versions (2.34 and beyond) we won't be able to parse and fix the old code anymore. But maybe that's not a reason to avoid mentioning it. Once we get to future breaking changes it will be those changes that need fixing not the array syntax anymore, so I guess the auto-formatter/canonicalizer will at that point work just fine for whatever syntax needs changing at that point.

jgabry commented 1 year ago

I opened PR https://github.com/stan-dev/stan-dev.github.io/pull/191 to add the disclaimer at the top of the case studies page. I didn't mention the auto-formatter/canonicalizer but I can update it to mention it if we want that. (It is accessed differently in the different interfaces, so we'd have to decide whether to just mention it exists or actually demo how to use it in the different interfaces.)

jgabry commented 1 year ago

Is the process for updating the ones in example-models repo the following?

edit any stan code in the Rmd file (separate stan files are already up to date)
regenerate html
submit Rmd pr to example models
submit html PR to website repo

(I just did this for the HMM interface example case study, but I can update my PRs if this process isn't right)

WardBrian commented 1 year ago

Yep, sounds right to me. I have just updated the new ODE and golf case studies like this

stan-dev / stan-dev.github.io

Update case studies to use new language syntax #189