Closed marcdotson closed 3 years ago
@jeff-dotson FWIW, I've tested your heterogeneous and homogeneous pathologies for 400, 1000, and 2000-member ensembles for ANA only. @RogerOverNOut interestingly, it looks like the weights, now appended to ensemble_fit
(new model output loaded to the shared Drive folder) continue to heavily weight the final ensemble member, even though I'm now randomly drawing 400 or 1000 ways to induce the pathology on the betas from the 2000 total in the simulated data. In other words, if there was any "signal" in the final ensemble member previously, that is no longer the case.
Here's the homogeneous, 400-member ensemble results:
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2861 | 0.460 | 0.397 |
Ensemble | -2922 | 0.458 | 0.381 |
Here's the homogeneous, 1000-member ensemble results:
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2861 | 0.460 | 0.397 |
Ensemble | -2913 | 0.458 | 0.383 |
Here's the homogeneous, 2000-member ensemble results:
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2861 | 0.460 | 0.397 |
Ensemble | -2928 | 0.461 | 0.380 |
Here's the heterogeneous, 400-member ensemble results:
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2985 | 0.479 | 0.379 |
Ensemble | -3028 | 0.474 | 0.365 |
Here's the heterogeneous, 2000-member ensemble results:
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2985 | 0.479 | 0.379 |
Ensemble | -3031 | 0.472 | 0.365 |
Beyond a possible issue with the ensemble weights, I don't know if this says much about heterogeneous vs. homogeneous pathologies or even the ensemble size, but I thought I'd place it here as further evidence for what we know already: This ensemble does just as well as HMNL for ANA but will really shine when we get to joint pathologies and then real data -- where the the data is genuinely pathological with respect to the HMNL. I should have something to share by Friday.
To illustrate the problem with the current ensemble weights, here is a quick plot of the weights by ensemble member for the above models.
For the homogeneous, 400-member ensemble:
For the homogeneous, 1000-member ensemble:
For the homogeneous, 2000-member ensemble:
For the heterogeneous, 400-member ensemble:
For the heterogeneous, 1000-member ensemble:
For the heterogeneous, 2000-member ensemble:
Ideas to address the ensemble weights problem:
@jeff-dotson @RogerOverNOut, as promised, using the ANA-only ensembles, here are equal weights and dropping the last member and renormalizing compared with loo overweighting the last member:
Here's the homogeneous, 400-member ensemble results:
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2861 | 0.460 | 0.397 |
Ensemble | -2922 | 0.458 | 0.381 |
Ensemble (Equal Weights) | -2931 | 0.458 | 0.380 |
Ensemble (Renormalized) | -2903 | 0.458 | 0.379 |
Here's the homogeneous, 1000-member ensemble results:
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2861 | 0.460 | 0.397 |
Ensemble | -2913 | 0.458 | 0.383 |
Ensemble (Equal Weights) | -2931 | 0.457 | 0.380 |
Ensemble (Renormalized) | -2670 | 0.458 | 0.347 |
Here's the homogeneous, 2000-member ensemble results:
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2861 | 0.460 | 0.397 |
Ensemble | -2928 | 0.461 | 0.380 |
Ensemble (Equal Weights) | -2930 | 0.454 | 0.380 |
Ensemble (Renormalized) | -2899 | 0.462 | 0.376 |
Here's the heterogeneous, 400-member ensemble results:
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2985 | 0.479 | 0.379 |
Ensemble | -3028 | 0.474 | 0.365 |
Ensemble (Equal Weights) | -3032 | 0.475 | 0.364 |
Ensemble (Renormalized) | -2976 | 0.475 | 0.360 |
Here's the heterogeneous, 1000-member ensemble results:
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2985 | 0.479 | 0.379 |
Ensemble | -3016 | 0.458 | 0.369 |
Ensemble (Equal Weights) | -3033 | 0.472 | 0.364 |
Ensemble (Renormalized) | -2621 | 0.476 | 0.319 |
Here's the heterogeneous, 2000-member ensemble results:
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2985 | 0.479 | 0.379 |
Ensemble | -3031 | 0.472 | 0.365 |
Ensemble (Equal Weights) | -3033 | 0.472 | 0.364 |
Ensemble (Renormalized) | -3016 | 0.472 | 0.363 |
So it doesn't do much, does it? I mean, the LOO fit changes most. We should still investigate, but I'm going to move on to getting the joint ensemble working knowing we can use equal weights or renormalize and essentially get the same results.
@jeff-dotson @RogerOverNOut this is a temporary stop-gap, but here are some results for ANA and screening jointly with equal weights where the members that have had the ELBO error have been dropped:
Here's the heterogenous, 200-member (actually 175-member after dropping) ensemble results:
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2658 | 0.417 | 0.354 |
Ensemble (Equal Weights) | -828430 | 0.406 | 0.351 |
Here's the heterogenous, 400-member (actually 400-member, none dropped) ensemble results:
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2658 | 0.417 | 0.354 |
Ensemble (Equal Weights) | -807416 | 0.404 | 0.352 |
Here's the heterogenous, 1000-member (actually 625-member after dropping) ensemble results:
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2658 | 0.417 | 0.354 |
Ensemble (Equal Weights) | -811617 | 0.399 | 0.349 |
So, uh, not great.
With the screening pathology, I consistently get Error in sampler$call_sampler(c(args, dotlist)): stan::variational::advi::calc_ELBO:
OR stan::variational::normal_meanfield::calc_grad: The number of dropped evaluations has reached its maximum amount (100). Your model may be either severely ill-conditioned or misspecified.
This behavior isn't present for the ANA pathology.
A few things to point out from Automatic Variational Inference in Stan:
I'm going to try and put together a joint ensemble with ANA and respondent quality for the time being and then I'll return to screening as needed.
@jeff-dotson @RogerOverNOut putting aside screening for a moment, here are the results for simulated data with both ANA and respondent quality estimated with a heterogenous, 1000-member ensemble:
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2958 | 0.504 | 0.386 |
Ensemble | -3022 | 0.496 | 0.369 |
Ensemble (Equal Weights) | -3027 | 0.497 | 0.368 |
The weighting indicates variety in the ensemble members (none of that final-model up-weighting):
But clearly we're missing something -- perhaps more variation in the clever randomization?
I'm still working to get this working on a real dataset. The same problems we saw when screening is present is there for real data, so I'm just letting it run with actual sampling instead of VB for now.
Inducing more variation in the clever randomization by:
After making sure ANA applies at the attribute level and fixing the number of attribute levels being hard-coded in clever_randomization()
, along with inducing some more variation by randomizing the number of attributes to which ANA applies, here are the results for simulated data with both ANA and respondent quality estimated with a heterogenous, 1000-member ensemble:
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2953 | 0.488 | 0.379 |
Ensemble | -3065 | 0.503 | 0.358 |
Ensemble (Equal Weights) | -3073 | 0.50 | 0.356 |
The final-model up-weighting is back:
The model crashed that was running this on a real dataset. However, some of my issues may be compiler-specific.
Okay, I found another mistake in the code. We have never seen results that actually includes respondent quality as a pathology. Running the above again and starting on a detailed code review and documentation.
Sorry, not a great code maintainer yet.
Latest results for simulated data with both ANA and respondent quality estimated with a heterogenous, 1000-member ensemble (in this instance using sequential, full posterior sampling, each model with a single chain and thinned draws):
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2953 | 0.488 | 0.379 |
Ensemble | -2234 | 0.333 | 0.344 |
Ensemble (Equal Weights) | -2378 | 0.329 | 0.342 |
LOO appears to be doing its part:
Finally, results for real data where we account for both ANA and respondent quality. Again, it's a 1000-member ensemble using multiple weeks' worth of of sequential, full posterior sampling with single chains and thinned draws:
Model | LOO | Hit Rate | Hit Prob |
---|---|---|---|
HMNL | -2756 | 0.403 | 0.348 |
Ensemble (LOO Weights) | -871 | 0.259 | 0.244 |
Ensemble (Equal Weights) | -1263 | 0.245 | 0.247 |
All right, @jeff-dotson @RogerOverNOut that didn't take as long as I'd feared. The results are consistent with what we've seen -- and with real data that improvement in LOO with the ensemble weights is huge. Again, we aren't seeing it translated into predictive fit improvement, which again necessitates looking at other meta learners.
Oh, and, ruh roh:
I've updated the ensemble fit object in the shared folder, FWIW.
Closing out this issue with PR #67.
All previous work has been merged together with quality-of-life improvements added, as detailed in PR #61.
The
joint-pathologies
branch is to, as it suggests, get the conjoint ensemble running on the pathologies jointly (ANA and screening to start with).