DINA case study added - Githubissues

sylee30 commented 7 years ago

The source for the DINA model case study

bob-carpenter commented 7 years ago

Required that I load:

CDM
formatR

I tried to run this and got

> render("dina_independent.Rmd")

processing file: dina_independent.Rmd
  |.                                                                |   2%
  |...                                                              |   4%
  |....                                                             |   6%
  |.....                                                            |   8%
  |.......                                                          |  10%
  |........                                                         |  12%
  |.........                                                        |  14%
  |...........                                                      |  16%
  |............                                                     |  18%
  |.............                                                    |  20%
  |...............                                                  |  22%
  |................                                                 |  24%
  |.................                                                |  27%
  |...................                                              |  29%
  |....................                                             |  31%
  |.....................                                            |  33%
  |.......................                                          |  35%
  |........................                                         |  37%
  |.........................                                        |  39%
  |...........................                                      |  41%
  |............................                                     |  43%
  |.............................                                    |  45%
  |...............................                                  |  47%
  |................................                                 |  49%
  |.................................                                |  51%
  |..................................                               |  53%
label: sim_plot_indep (with options) 
List of 4
 $ cache     : logi TRUE
 $ tidy      : logi TRUE
 $ fig.height: num 8
 $ fig.cap   : chr "Discrepancies between estimated and generating parameters for the simulation. Points indicate the difference between the poster"| __truncated__

Quitting from lines 269-296 (dina_independent.Rmd) 
Error in `[.data.frame`(sim_summary, wanted_pars, c("mean", "2.5%", "97.5%")) : 
  undefined columns selected

We need to make sure things run from a clean start of R.

bob-carpenter commented 7 years ago

Overall this looks great.

I couldn't run, but I'll comment on the generated HTML.

Feel free to link your name to a web page or other resource
I find all the bold face for package names etc. distracting, but that's up to you
It needs to be updated to RStan 2.14.1 --- there was a bug in 2.12 that led to slightly biased fits (but nevertheless still buggy)
I'm not keen on defining arithmetic over booleans, like your xi (which I always confuse with zeta, but that's my problem)
it helps readability if you put spaces around the vbar in densities, \, | \,
we generally don't try to decorate vectors by making them bold, caps, etc. it just doesn't scale to higher order structures and again makes the formulas distracting because the bold jumps out so much
I'd just make references to the manual by chapter---there's a whole chapter on marginalizing latent discrete parameters plus the specialized mixture chapter; the page numbers change too quickly
I like Pr[...] with square brackets to indicate it's a functional, but feel free to leave it; Andrew doesn't distinguish these even for expectations or KL-divergence
target += isn't just for mixtures --- it's just a way to increment the underlying log density accumulator
"fully Bayesian" shouldn't be in quotes! In fact, you don't even need to say "full Bayesian" --- we just want to generate draws from the posterior, which will give us full Bayes for inference
you say "given in dina_nostructure.stan" but you might want to mention that's in the repository; does it correspond to something you print into the case study at some point?
we use two-space indents (never tabs) and always put a space before { and always put spaces after commas and around arithmetic operators. There's a style guide at the end of the manual if you'd like to follow it. We're going through and refactoring all of our examples soon to bring them up to current practices. Sorry the examples are so bad now.

we now have compound declare define, so this

vector[C] log_nu;
log_nu = log(nu);

can become

vector[C] log_nu = log(nu);

It might be more arithmetically stable and require less code duplication to report the probabilities in the end on the log scale; then a lot of that code in the middle can be turned into a function
The declaration real pi should be brought into the loop an dused with compound declare-define; no benefit from declaring scalars up top
There's a lot of duplicated work being done because the inner loop recomputes pi in a way that doesn't depend on the outer loop j. So it'd be more efficient to store an I x C matrix of values on the outside and then reuse as you loop over J
It doesn't look like you ever fit the model in section 1.4 (no structure for \nu_c)
You might want to mention that you have an implicit uniform distribution on the simplex variable nu.
I'm glad you used a modest number of iterations --- looks like a plenty high effective sample size
We tend to plot 50% intervals to do posterior calibration checks because they're more robust both in terms of the original fit (effective sample size for one end of a 95% interval is a 2.5% interval and only 2.5% the n_eff of the full model, so you need a lot more draws to appropriately characterize the tails). Also, with 50% intervals, about 50% should be in the intervals (the distribution should be binomial(N, 0.5) where N is the number of parameters, so you can even set up a hypothesis test, but I wouldn't go that far for this case study. So it's not that ideally they should all contain 0, but 95% of them should contain zero, but that's very low variance and thus not as powerful a calibration test as 50% intervals. You got 3 out of 40 missing at 95%, where the expected number of misses should be 2. So you see it's hard to calibrate here. You can leave it if you want, but just make sure to say that you expect binomial(40, 0.95) in their interval and you got 37.
For the attribute mastery at the end, I think you want to compare before taking means. But means will commute so maybe it's OK doing it your way. We tend to set this up as event probabilities, with something like Pr[theta[1] > theta[2] | y] where y is the data and theta[1] and theta[2] two parameters to compare. You set that up with an indicator variable
```
generated quantities {
int theta1_gt_theta2 = theta[1] > theta[2];
```
and then the posterior mean is the event probabilities (or as the probability theorists like to say, event probabilities are expectations of indicator functions).
I like the sensitivity and specificity measures, but it's reducing something probabilistic to a hard decision; we can also evaluate the log probabilities directly, but that's hard to interpret without information theoretic measures that aren't worth sidetracking into for this case study
You can use probs=c(0.05, 0.5, 0.95) to report just the median and 90% intervals. it cleans up the output a lot. We'll be changing the defaults to do that soon.
You can also compare Stan's MLE, which you get from optimizing. But it'll be a penalized MLE if you use priors. You can run Stan without priors as long as the posteiror is proper; even that's not required for a penalized MLE.
I think your diagnosis of the outlier is appropriate

bob-carpenter commented 7 years ago

Anyway, if you can get it set up so I can run it end to end, I'll leave it up to you how much of the rest you want to update, then we can add it to the site. Sorry it took so long to get to this---the last two weeks have been rather busy here at Columbia.

bob-carpenter commented 7 years ago

@sylee30 : Is this ready to be added? Before we can merge something, we need a clear copyright statement along with an open-source license for the contributed code.

sylee30 commented 7 years ago

Sorry for the late response. I will send a request soon. Thanks!

On Wed, Apr 12, 2017 at 10:36 AM, Bob Carpenter notifications@github.com wrote:

@sylee30 https://github.com/sylee30 : Is this ready to be added? Before we can merge something, we need a clear copyright statement along with an open-source license for the contributed code.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/example-models/pull/88#issuecomment-293652769, or mute the thread https://github.com/notifications/unsubscribe-auth/AJNItCtGFgWBYAKK6U3QBiR3U2qYWkLcks5rvQuhgaJpZM4LrA7b .

sylee30 commented 7 years ago

DINA case study has been revised and should be working now.

bob-carpenter commented 7 years ago

I found where Michael stashed the case studies---under the path users/documentation; the link was OK, but that one didn't get moved.

Bob

On Jun 27, 2017, at 6:49 AM, Seung Yeon Lee notifications@github.com wrote:

DINA case study has been revised and should be working now.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

stan-dev / example-models

DINA case study added #88