rBatt commented 8 years ago

I want to know if the current Stan model "works". With the old JAGS model, I assumed it worked from the perspective that it was correctly programmed and implemented the conceptual model. I.e., I assumed the model was verified. What I was doing was model validation, by simulating varying degrees of real-world problems and seeing how well the model performed.

I am having trouble verifying the Stan model. I've been trying to repurpose my old simulations to see if it can recover parameters, and the results are very disappointing. I never actually verified the JAGS model, so I'm wondering if maybe this could be a problem there, too.

Currently I'm thinking that the model structure is just really tough to fit. We're trying to fit a quadratic curve in logit space. That's a lot of bending. Further, if you think of each point along the inverse-logit curve as a proportion, you could ask "how many binary observations does it take to adequately estimate a proportion"? Then, whatever your answer to that is, you have to multiply it by the number of proportions along the curve that you would need to know to estimate the curve correctly.

For a proportion we can define the number of samples required to estimate a proportion within a desired margin of error:

That is for when the true probability is at 0.5, which is the worst-case scenario. A more general form might choose to move the 0.5 out of the exponent, and instead do a p*(1-p).

In R code:

sampSize <- function(M=0.1, z_star=1.96, p=0.5){((z_star)/M)^2*p*(1-p)-4}

That's a simplistic equation, and I've seen variations on it that I don't feel like comparing rigorously. And that also just applies to an intercept-only model. It suggests that a sample size of ~100 is needed per categorical/ intercept term. Let's be naïve and assume it's 100 per any predictor/ parameter (excluding residual variance). Let's also ignore detectability and hierarchy (because we are doing fine on the detectability sub-model for the most part, and because we have a proportional increase in data as we add hierarchical species parameters ... it's actually a net gain in samples/parameter because of the prior sharing). If we do that, we'd need at least 300 samples for the b0 + b1_x + b1_x^2 model. That also assumes that the covariates are orthogonal, which obviously they are not.

Beyond the sample size issue, I'm wondering how challenging it is to fit that quadratic coefficient. Maybe tougher than I thought. Particularly, there's a really sharp transition in the response curves when the signs of some of these parameters switch. And I've been simulating the quadratic term as being really close to 0. Perhaps that's causing problems.

How to Verify (Steps to Try)

First Steps (easiest)

[x] Drop quadratic terms (from simulation, and model), see if fit improves
- if improves, indicates the terms are just tough to fit, probably need long iterations, and large sample size
[x] Drop Random Variable
- If fit improves, may indicate a problem with how RV Covariates are coded
[x] Try different coefficients in an effort to get different occurrence distributions, refit
- If fit improves, indicates that the model is sensitive to assumptions about the species ranges, relative frequencies of 1/0, sample size, or even priors.

If fiddling with simulation terms doesn't work If any of these works, just need to constrain parameter space ... the model simply did not find the optimum.

[x] Put informative priors on the coefficients. Tell the model what the answer is, and see if it confirms
[x] Put tighter priors on coefficients, see if model wanders less
[ ] Try truncating the prior distributions.
- E.g., if I know that a3 is should be negative when a2 is positive, try putting them in as half distributions (that correspond to known simulated signs). Tricky though, because I'm not sure if different vectors in a matrix can easily be defined as having different constraints. Might require a lot reworking declarations.

If Model Fails with Adjusted Priors

[x] Systematically vary sample size, and see how model fit changes. Change number of sites, in particular. Costly because this affects computation time.
[x] Analyze the data with lmer() as a mixed effects model, where the simulation has perfect detection. See how this approach performs relative to the Stan model.
[ ] Analyze the data with the old JAGS program. Compare accuracy of results. (might be harder than it sounds, just because of data formatting etc).

If there are other solid suggestions, I'll take them into consideration. But I don't want to pursue arbitrary suggestions when the time it takes to execute any 1 could be large. I'm looking for a solution that won't require a lot of modification of the core model. My hunch is that the problem is in the data, not the model.

rBatt commented 8 years ago

Dropping the quadratic coefficient and using constant predictors doesn't seem to solve the problem.

rBatt commented 8 years ago

Changed linear coefficient to have parent of 0.75 with sd of 0.01 among species. Put in a super informative prior for the mean and the standard deviation of the hierarchical distribution of alpha[2,] (which is the linear coefficient).

Previously model was getting the intercept term in alpha (alpha[1,]) almost exactly correct, but was failing on the coefficient. With the super informative prior, it is massively overestimating the intercept, but getting the linear coefficient correct. This is bizarre behavior. My hunch is that by focusing on alpha, I'll figure out what's wrong.

I'm going to check and make sure the data input format is what I think it is, and a couple basic things like that, as I think of them. Otherwise, I think this is pointing to the Stan model probably being wrong.

What's weird is that the parameter it's getting wrong is carried around in a matrix with the parameter that's being fit correctly. This means that it's probably not the complicated likelihood expressions that are wrong (a good thing), and that it's something else. Likely programming, but not sure.

rBatt commented 8 years ago

I may have found the problem. Did anyone else know this?

sort(as.character(1:15))
[1] "1"  "10" "11" "12" "13" "14" "15" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"

I think this caused some jumbling in the ordering of the covariates relative to the response variable. This is definitely a huge bug, and I think it's been around the whole time I was testing the Stan models. Would explain why the intercept works, but any continuous covariate doesn't: the intercept doesn't change among the strata or years (both of these columns were converted to characters).

Still working through full implications of this bug. I don't know for sure if it will solve the problem, but it sure explains a lot.

To fix: either I can pad the start of all character strings with 0's to make them the same length, I could try factors (what's the default sorting/ ordering there? ugh ... probably even less predictable), or I can try coercing to numeric. Problem is that there's no guarantee what the input format of the columns are, which is why the function I wrote to format the data first converted to character (anything can convert to character).

This explains so much. Hopefully it solves the problem.

Bug is in here: https://github.com/rBatt/trawl/blob/stan/trawlDiversity/R/msomData.R#L109-L114

mtingley commented 8 years ago

Ha. Ouch, that hurts. Classic coding error.

Factors will sort alphabetically so a factor of numeric may have the same default sorting problem.

Why do you have numbers stored as character in the first place? Nevermind, the answer is irrelevant, your solutions seem reasonable.

rBatt commented 8 years ago

They aren't originally stored as characters, but I like to try to standardize class when I can.

Particularly for merging. Also, data.table is a super efficient data structure, and as a result, it doesn't do a lot of auto-coercing of class types. So if you try to assign NA to an element in a column that has numbers, you can sometimes get an error because NA is type "logical", and numbers are, well, numeric. Instead, you need to assign NA_character.

On the other hand, I've often been bitten by assuming that class coercions will be consistent among functions. E.g., it can really suck when you're trying to do a merge, and the column somehow ends up being a factor for one data object, and an integer vector in the other. This happens very often with things like "year", which is a simple integer that's repeated only a handful of times. R might end up converting the 4-digit year into numbers that are factor levels (1, 2, ...), and not finding any matches (1996, 1997, .... no matches with 1, 2, ....). Even worse is if you have day of year, starting in the fall or winter (day 300, 301, ... 365, 1, 2, ...) that is matched with a factor (1, 2, .... 66, 67, ...) and the damn thing merges January 1, 1997 (day 1, factor level 67) with October 26, 1996 (day 300, factor level 1).

To avoid those problems, I like to use character. It's great for "matching" things. I will often sort characters too, but I don't particularly assume a sort order, just consistent sorting. The problem here was that I didn't actually request a sort, and forgot that data.table's merge will sort on the by= column after a merge ... oops. Because the species occurrence data wasn't processed by the exact same procedure as the covariates, one ended up getting this treatment and the other didn't.

Bigger picture, this is one reason why it's good to cary objects together in the same object, like a data.table or data.frame, as opposed to a whole bunch of individually named objects. E.g., one thing that helped me track this bug down was that I converted my Stan code to using matrix algebra everywhere. This is for speed, but also for simplicity of expression (and generality ...). So when I saw the intercept parameter was being fit properly, but the slope wasn't, I knew something was up because if a coding error affected one, in most places it should have affected the other. Conversely, it is too memory inefficient to code these sparse data as data.table's (containing covariates, responses, and all their "ID variables") instead of separates arrays containing each set of covariates and responses (arrays don't have to encode the "ID variables" as often, because the information is contained in the definition of the dimensions). It was this attempt to save memory that separated the objects, and lead to the differential treatment by sorting.

After fiddling a bit this morning, I don't think that adjusting the contents of the ID variables is the way to go (i.e., don't pad with 0's). I think this just exposes me to similar surprises. Instead, I think it's important to explicitly reference and check the ordering. A general schema I like to follow is that if I have a data.frame or matrix with columns "year" then "species", I like to refer to dat[,"year"], not dat[,1]. So I can explicitly relying on dimension names in places, and/or more frequently check and intentionally sort objects.

By far the easiest solution is to ensure that the data.table passed into the function is already sorted by the columns the ID variables.

along those lines, I don't think this mistake affected my original testing of the EBS data. The data.tables with real data in them were already sorted, and the stratum names tended to have the same number of digits anyway.

rBatt commented 8 years ago

OK, I decided that it's a good idea to ensure class matching instead of converting to an arbitrary class. But this is proving to be tricky.

See question: http://stackoverflow.com/q/34091811/2343633

rBatt commented 8 years ago

That conversion issue isn't trivial, check out the attention that question got pretty quickly! The developer of data.table asked me to make it a feature request on GitHub: https://github.com/Rdatatable/data.table/issues/1460

I got an answer on SO that looks like it'll work for now. Inelegant, but concise and probably pretty effective.

rBatt commented 8 years ago

Also, watch out for this: http://stackoverflow.com/questions/34093056/asx-double-and-as-doublex-are-inconsistent

rBatt commented 8 years ago

This "weirdness" led to this query to R devel: https://stat.ethz.ch/pipermail/r-devel/2015-December/072079.html

rBatt commented 8 years ago

My original attempt at a solution might prove to be the way to go, after all, because integers are automatically promoted to double or numeric when needed. Again, see SO question/ answers

rBatt commented 8 years ago

Just to visually illustrate what I think the problem is:

issue103_diagnosis

Although I'm still trying to get the sorting figured (sorted, ha) out.

rBatt commented 8 years ago

Figures from verification (in fixing commit, then deleted after):

psi_theta_est_vs_true

alpha_est_boxplots_bluedottrue

The blue dots in the boxplots are true values.

Simulation settings:

set.seed(1337)
sim_out <- sim_occ(ns=12, grid.w=5, grid.h=7, grid.t=3, h.slope=0.5, w.sd=0.5, n0s=5, detect.mus=c(0,0.5,1), n.ss=9, alpha_mu=c(0.75, 0.25, -5), alpha_sd=c(0.5, 0.015, 0.01), format.msom="jags")

Analysis settings:

model_file <- "trawl/trawlDiversity/inst/stan/msomStatic.stan"

sim_msom <- rstan::stan(
    file=model_file, 
    data=staticData, 
    control=list(stepsize=0.01, adapt_delta=0.95, max_treedepth=15),
    chains=4, iter=30, seed=1337, cores=4, verbose=F
)

Note that the simulation was pretty small, and only 30 chain iterations were used.

rBatt commented 8 years ago

https://groups.google.com/forum/?utm_source=digest&utm_medium=email#!topic/stan-users/xMDiFDQB0SU

rBatt / trawl

Is the model right? Verification, not validation #103

How to Verify (Steps to Try)