Tune running the conjoint ensemble

marcdotson commented 3 years ago

Mine, working on the ensemble-tuning branch.

[x] Use the posterior from a full model as priors in the ensemble.
[x] Increase the number of ensemble members/models.

Initial results for a 100-member ensemble using Beta for predictive fit:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2732	0.566	0.446
Ensemble	-13.4	0.567	0.402

marcdotson commented 3 years ago

Using the posterior means and scales from a full model as hyper-prior values (everything but the Omega_shape hyperparameter) in the ensemble has no impact on predictive fit as it is currently computed.

marcdotson commented 3 years ago

@jeff-dotson and @RogerOverNOut: Increasing the number of members in the ensemble, even from just 100 to 200 members, clearly has an impact on predictive fit as it is currently computed.

Results for a 200-member ensemble using Beta for predictive fit:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2732	0.581	0.456
Ensemble	-12.7	0.595	0.414

While I'm still working on how to easily run (and save output) for ensembles with more than 200 members, I think this is a clear indication of what we expected and hoped.

RogerOverNOut commented 3 years ago

Cool!

On Wed, Feb 3, 2021 at 2:55 AM Marc Dotson notifications@github.com wrote:

@jeff-dotson https://github.com/jeff-dotson and @RogerOverNOut https://github.com/RogerOverNOut: Increasing the number of members in the ensemble, even from just 100 to 200 members, clearly has an impact on predictive fit as it is currently computed (see the updated table above). While I'm still working on how to easily run (and save output) for ensembles with more than 200 members, I think this is a clear indication of what we expected and hoped.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/marcdotson/conjoint-ensembles/issues/48#issuecomment-772310449, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHEP55FQFKVMYND4J2VFTCLS5D6QNANCNFSM4UXHD46Q .

marcdotson commented 3 years ago

@jeff-dotson and @RogerOverNOut if you have any other possible solutions to this problem, I'm all ears. I'm trying to get over this computational issue so I don't bottleneck the project:

Running ensembles where the number of members/models > 400 runs into two memory issues:

When running and storing the output of the conjoint ensemble itself. Currently the model fit objects are stored in a list.
When extracting ensemble draws for use to produce ensemble weights and model/predictive fit.

The computational time at even 400 members/models has become prohibitive, even when modifying virtual memory allocations and extracting draws in parallel, making investigating the optimal number of members/models in the ensemble untenable. How can we address this computational bottleneck?

[x] "Merge" the two steps by fitting and then extracting and storing only the parameter draws needed for ensemble weights, predictive fit, and use in the market simulator (Gamma, tau and Omega recombined into Sigma, and log_lik).
[x] Reducing the number of draws fit/extracted for every member/model (default is output_samples = 1000) or even just compute and store the means of the needed parameters.
[x] Write a single function for the fit/extract process and run in parallel to speed up the estimation process.

Depending on how many members/models we want in the ensemble and what parameter summaries we need, this could quickly become a big data problem. In that case, we would probably need to use a database to store model output so we can extract what we need in batches or even run the models on the database.

marcdotson commented 3 years ago

Let's see how the results change for the new predictive_fit_ensemble() without worrying about the Pareto k diagnostic warning and just saving out 100 draws. Results for a 200-member ensemble without tol_rel_obj = 0.0001 but with output_samples = 100, using Gamma means for predictive fit:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2703	0.581	0.457
Ensemble	-13.4	0.593	0.416

Okay, this aligns with what we saw with a complete set of draws. To check trying to fix the Pareto k warning, here's results for a 200-member ensemble with tol_rel_obj = 0.0001 and output_samples = 100 using Gamma means for predictive fit:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2703	0.581	0.457
Ensemble	-12.8	0.594	0.415

Same results. Looks like we don't need to slow down the estimation with tol_rel_obj = 0.0001! Let's take a step back and check once more WRT the draws. Results for a 200-member ensemble with tol_rel_obj = 0.0001 and output_samples = 1000, using Gamma means for predictive fit:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2703	0.581	0.457
Ensemble	-12.8	0.592	0.416

Consistency! Means are sufficient. Needless big data problem averted.

marcdotson commented 3 years ago

By doing fit/extract in one step (only extracting Gamma and log_lik) we can finally get our first look at the results of a 400-member ensemble. Results for a 400-member ensemble without tol_rel_obj = 0.0001 but with output_samples = 100, using Gamma means for predictive fit:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2703	0.581	0.457
Ensemble	-12.8	0.581	0.418

Clearly we need to look at larger ensembles.

marcdotson commented 3 years ago

I've uploaded the set of ensemble_draws. I would take a look at what I've done in ensemble-tuning with your predictive_fit_ensemble() since I may have commented something out that was actually important.

Also, here are results for a 100-member ensemble using Gamma for predictive fit:

Model	LOO	Hit Rate	Hit Prob
HMNL	-2732	0.565	0.446
Ensemble	-12.8	0568	0.402

Consistent!

RogerOverNOut commented 3 years ago

Hi Guys,

Good news, when I used the sim_ana_400 data, the ensemble fit changes to

Hit rate: .5827 Hit probs: .4176

I'll tidy everything up after dinner and push the updated code.

-Roger

On Thu, Feb 11, 2021 at 1:36 PM Marc Dotson notifications@github.com wrote:

I've uploaded the set of ensemble_draws. I would take a look at what I've done in ensemble-tuning with your predictive_fit_ensemble() since I may have commented something out that was actually important.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/marcdotson/conjoint-ensembles/issues/48#issuecomment-777701951, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHEP55FUX3AJEJXD3K5WERTS6QPRTANCNFSM4UXHD46Q .

marcdotson commented 3 years ago

Hooray! I've updated all the tables above with @RogerOverNOut's modified function. We are good to go.

marcdotson commented 3 years ago

Parallelized form of fit/extract/average is so much faster. Testing with just 10 ensembles:

Using a for loop took 227.986 seconds.
Using lapply() took 193.205 seconds.
Using mclapply() took 74.546 seconds.

Changes submitted with PR #56.

marcdotson / conjoint-ensembles

Tune running the conjoint ensemble #48