Running the ensemble for two or more pathologies jointly

marcdotson commented 3 years ago

All previous work has been merged together with quality-of-life improvements added, as detailed in PR #61.

The joint-pathologies branch is to, as it suggests, get the conjoint ensemble running on the pathologies jointly (ANA and screening to start with).

marcdotson commented 3 years ago

@jeff-dotson FWIW, I've tested your heterogeneous and homogeneous pathologies for 400, 1000, and 2000-member ensembles for ANA only. @RogerOverNOut interestingly, it looks like the weights, now appended to ensemble_fit (new model output loaded to the shared Drive folder) continue to heavily weight the final ensemble member, even though I'm now randomly drawing 400 or 1000 ways to induce the pathology on the betas from the 2000 total in the simulated data. In other words, if there was any "signal" in the final ensemble member previously, that is no longer the case.

Here's the homogeneous, 400-member ensemble results:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2861	0.460	0.397
Ensemble	-2922	0.458	0.381

Here's the homogeneous, 1000-member ensemble results:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2861	0.460	0.397
Ensemble	-2913	0.458	0.383

Here's the homogeneous, 2000-member ensemble results:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2861	0.460	0.397
Ensemble	-2928	0.461	0.380

Here's the heterogeneous, 400-member ensemble results:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2985	0.479	0.379
Ensemble	-3028	0.474	0.365

Here's the heterogeneous, 2000-member ensemble results:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2985	0.479	0.379
Ensemble	-3031	0.472	0.365

Beyond a possible issue with the ensemble weights, I don't know if this says much about heterogeneous vs. homogeneous pathologies or even the ensemble size, but I thought I'd place it here as further evidence for what we know already: This ensemble does just as well as HMNL for ANA but will really shine when we get to joint pathologies and then real data -- where the the data is genuinely pathological with respect to the HMNL. I should have something to share by Friday.

marcdotson commented 3 years ago

To illustrate the problem with the current ensemble weights, here is a quick plot of the weights by ensemble member for the above models.

For the homogeneous, 400-member ensemble:

For the homogeneous, 1000-member ensemble:

For the homogeneous, 2000-member ensemble:

For the heterogeneous, 400-member ensemble:

For the heterogeneous, 1000-member ensemble:

For the heterogeneous, 2000-member ensemble:

marcdotson commented 3 years ago

Ideas to address the ensemble weights problem:

[ ] Investigate why loo would assign most weight to the last member.
[x] Manually set equal weights.
[ ] Using a Dirichlet prior over the ensemble members.

marcdotson commented 3 years ago

@jeff-dotson @RogerOverNOut, as promised, using the ANA-only ensembles, here are equal weights and dropping the last member and renormalizing compared with loo overweighting the last member:

Here's the homogeneous, 400-member ensemble results:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2861	0.460	0.397
Ensemble	-2922	0.458	0.381
Ensemble (Equal Weights)	-2931	0.458	0.380
Ensemble (Renormalized)	-2903	0.458	0.379

Here's the homogeneous, 1000-member ensemble results:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2861	0.460	0.397
Ensemble	-2913	0.458	0.383
Ensemble (Equal Weights)	-2931	0.457	0.380
Ensemble (Renormalized)	-2670	0.458	0.347

Here's the homogeneous, 2000-member ensemble results:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2861	0.460	0.397
Ensemble	-2928	0.461	0.380
Ensemble (Equal Weights)	-2930	0.454	0.380
Ensemble (Renormalized)	-2899	0.462	0.376

Here's the heterogeneous, 400-member ensemble results:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2985	0.479	0.379
Ensemble	-3028	0.474	0.365
Ensemble (Equal Weights)	-3032	0.475	0.364
Ensemble (Renormalized)	-2976	0.475	0.360

Here's the heterogeneous, 1000-member ensemble results:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2985	0.479	0.379
Ensemble	-3016	0.458	0.369
Ensemble (Equal Weights)	-3033	0.472	0.364
Ensemble (Renormalized)	-2621	0.476	0.319

Here's the heterogeneous, 2000-member ensemble results:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2985	0.479	0.379
Ensemble	-3031	0.472	0.365
Ensemble (Equal Weights)	-3033	0.472	0.364
Ensemble (Renormalized)	-3016	0.472	0.363

So it doesn't do much, does it? I mean, the LOO fit changes most. We should still investigate, but I'm going to move on to getting the joint ensemble working knowing we can use equal weights or renormalize and essentially get the same results.

marcdotson commented 3 years ago

@jeff-dotson @RogerOverNOut this is a temporary stop-gap, but here are some results for ANA and screening jointly with equal weights where the members that have had the ELBO error have been dropped:

Here's the heterogenous, 200-member (actually 175-member after dropping) ensemble results:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2658	0.417	0.354
Ensemble (Equal Weights)	-828430	0.406	0.351

Here's the heterogenous, 400-member (actually 400-member, none dropped) ensemble results:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2658	0.417	0.354
Ensemble (Equal Weights)	-807416	0.404	0.352

Here's the heterogenous, 1000-member (actually 625-member after dropping) ensemble results:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2658	0.417	0.354
Ensemble (Equal Weights)	-811617	0.399	0.349

So, uh, not great.

marcdotson commented 3 years ago

With the screening pathology, I consistently get Error in sampler$call_sampler(c(args, dotlist)): stan::variational::advi::calc_ELBO: OR stan::variational::normal_meanfield::calc_grad: The number of dropped evaluations has reached its maximum amount (100). Your model may be either severely ill-conditioned or misspecified. This behavior isn't present for the ANA pathology.

A few things to point out from Automatic Variational Inference in Stan:

The evidence lower bound (ELBO) is a proxy to the Kullback-Leibler (KL) divergence, since the KL divergence often doesn’t have an analytic form, and we want to find a good approximating density for the posterior to conduct variational inference. Maximizing the ELBO minimizes the KL divergence and relies on the joint model rather than the posterior itself.
Whatever randomization we are inducing with screening in particular must be causing problems for the transformation-based approach to finding an approximation to the posterior. The ELBO is maximized (and the KL divergence minimized) in the standardized space with a fixed standard Gaussian approximation. Why would screening in particular cause this?
Reducing the size of the negative number used to induce screening (from -1000 to -10) appears to have an impact. Too large a number was causing infinite Pareto K diagnostic errors, which I assume was connected with the ELBO evaluation dropping in the transformed space. However, even when reducing the size of the Pareto K diagnostic error, the problem persists for screening.

I'm going to try and put together a joint ensemble with ANA and respondent quality for the time being and then I'll return to screening as needed.

marcdotson commented 3 years ago

@jeff-dotson @RogerOverNOut putting aside screening for a moment, here are the results for simulated data with both ANA and respondent quality estimated with a heterogenous, 1000-member ensemble:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2958	0.504	0.386
Ensemble	-3022	0.496	0.369
Ensemble (Equal Weights)	-3027	0.497	0.368

The weighting indicates variety in the ensemble members (none of that final-model up-weighting):

But clearly we're missing something -- perhaps more variation in the clever randomization?

I'm still working to get this working on a real dataset. The same problems we saw when screening is present is there for real data, so I'm just letting it run with actual sampling instead of VB for now.

marcdotson commented 3 years ago

Inducing more variation in the clever randomization by:

[x] Modifying clever randomization to have ANA operate at the attribute-level instead of level-level, which requires identifying the number of attribute levels and generalizing the function for real data (probably why I was having problems running an ensemble on real data to begin with).
[x] Look at inducing more variation in clever randomization for both ANA and respondent quality.

marcdotson commented 3 years ago

After making sure ANA applies at the attribute level and fixing the number of attribute levels being hard-coded in clever_randomization(), along with inducing some more variation by randomizing the number of attributes to which ANA applies, here are the results for simulated data with both ANA and respondent quality estimated with a heterogenous, 1000-member ensemble:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2953	0.488	0.379
Ensemble	-3065	0.503	0.358
Ensemble (Equal Weights)	-3073	0.50	0.356

The final-model up-weighting is back:

The model crashed that was running this on a real dataset. However, some of my issues may be compiler-specific.

marcdotson commented 3 years ago

Okay, I found another mistake in the code. We have never seen results that actually includes respondent quality as a pathology. Running the above again and starting on a detailed code review and documentation.

Sorry, not a great code maintainer yet.

marcdotson commented 3 years ago

Latest results for simulated data with both ANA and respondent quality estimated with a heterogenous, 1000-member ensemble (in this instance using sequential, full posterior sampling, each model with a single chain and thinned draws):

Model	LOO	Hit Rate	Hit Prob
HMNL	-2953	0.488	0.379
Ensemble	-2234	0.333	0.344
Ensemble (Equal Weights)	-2378	0.329	0.342

LOO appears to be doing its part:

marcdotson commented 3 years ago

Finally, results for real data where we account for both ANA and respondent quality. Again, it's a 1000-member ensemble using multiple weeks' worth of of sequential, full posterior sampling with single chains and thinned draws:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2756	0.403	0.348
Ensemble (LOO Weights)	-871	0.259	0.244
Ensemble (Equal Weights)	-1263	0.245	0.247

All right, @jeff-dotson @RogerOverNOut that didn't take as long as I'd feared. The results are consistent with what we've seen -- and with real data that improvement in LOO with the ensemble weights is huge. Again, we aren't seeing it translated into predictive fit improvement, which again necessitates looking at other meta learners.

Oh, and, ruh roh:

I've updated the ensemble fit object in the shared folder, FWIW.

marcdotson commented 3 years ago

Closing out this issue with PR #67.

marcdotson / conjoint-ensembles

Running the ensemble for two or more pathologies jointly #64