marcdotson / modeling-heterogeneity

Exploring covariates and models for preference heterogeneity.
MIT License
0 stars 0 forks source link

Initial model runs #9

Open marcdotson opened 5 years ago

marcdotson commented 5 years ago

The initial model runs are a comparison of standard covariates vs. geolocation covariates. Some concerns from @paulstat6:

Here is the final data set. I am concerned but not really all that surprised that we only have observed 150 of them at a dealership. Let me know if that is concerning to you as well. That is one of the issues with observed data is that you don’t always see the behavior you want to monitor. The counts are appended at the end. Large outliers for those that it looks like they work at a dealership. Might even turn them into binary variables to just say if they were observed visiting in the past 6 months of not.

And some feedback from @adam-n-smith:

Agree — it may be best to start with an indicator of whether they visited any dealership. Doesn’t look like we’ll have enough data to measure the effects of visits at particular dealerships. Is there other information that we could use in the geolocation data? For example, do we know a home/work address to measure how long their commute is? Or even looking at total miles driven in some window of time?

I'll start some initial model runs and documenting things in preparation for ART Forum.

marcdotson commented 5 years ago

Notes from @adam-n-smith's and my discussion.

Stuff @marcdotson is looking at, working with the design:

@paulstat6, if you could answer these questions about geographic information:

paulstat6 commented 5 years ago

[Dynata]http://www.dynata.com/

Edward 'Paul' Johnson Director, Product Analytics

O:+1.801.379.4017 M:+1.801.380.2642 F:+1.801.379.5073

dynata.comhttp://www.dynata.com

Dynata | 3300 N Ashton Blvd, Suite 350 | Lehi, Utah, 84043, United States of America

The information contained in this e-mail message is intended for the use of the recipient(s) named above and is privileged and confidential. If you are not the intended recipient, you are formally notified that you have received this message in error and that any review, dissemination, distribution, or copying of the message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the original message. From: Marc Dotson notifications@github.com Sent: Tuesday, May 21, 2019 10:03 AM To: marcdotson/modeling-heterogeneity modeling-heterogeneity@noreply.github.com Cc: Edward Johnson Edward.Johnson@Dynata.com; Mention mention@noreply.github.com Subject: Re: [marcdotson/modeling-heterogeneity] Initial model runs (#9)

Notes from @adam-n-smithhttps://github.com/adam-n-smith's and my discussion.

Stuff @marcdotsonhttps://github.com/marcdotson is looking at, working with the design:

@paulstat6https://github.com/paulstat6, if you could answer these questions about geographic information:

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/marcdotson/modeling-heterogeneity/issues/9?email_source=notifications&email_token=ALLGUUFY5YMMXLCA6VXUF5DPWQMJPA5CNFSM4HMUCR5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV4MIZA#issuecomment-494453860, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALLGUUE6MDBYUPKCOKGFJN3PWQMJPANCNFSM4HMUCR5A.

marcdotson commented 5 years ago

@adam-n-smith @paulstat6 here are some initial results. Note that I've included the actual prices in order to reduce the dimensions of the design matrix. I've scaled down the price by 10,000. You can see what I've done in Code in the initial-model-runs branch.

The three models are:

Here are the initial results:

  model           lmd    dic   waic
  <chr>         <dbl>  <dbl>  <dbl>
1 Intercept    -5218. 19889. 29507.
2 Geolocation  -5187. 19539. 44644.
3 Demographics -5252. 19984. 26938.

In terms of LMD and DIC, the Geolocation model is performing marginally better. WAIC is wacky, but it could just be acting strange (using the loo package, which doesn't always perform well for computing the WAIC for hierarchical models).

Here's what I'm doing next:

adam-n-smith commented 5 years ago

@marcdotson @paulstat6 Good plan. I think it would also be good to fit a model with both demographics and geolocation.

I'm interested in seeing parameter estimates for the upper-level model to see where these geolocation variables actually pick up traction.

marcdotson commented 5 years ago

@adam-n-smith @paulstat6 here are the results from some much-longer model runs over the weekend (100k instead of 20k iterations). The fit statistics are appended to the previous table, with the addition of hit rate and hit probability as hold-out sample predictive fit statistics.

There are two additional models:

Here are the results:

   model                         lmd    dic   waic     hr     hp
   <chr>                       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Intercept                  -5218. 19889. 29507. NA     NA    
 2 Geolocation                -5187. 19539. 44644. NA     NA    
 3 Demographics               -5252. 19984. 26938. NA     NA    
 4 Intercept 100k             -4662. 17487. 23747. NA     NA    
 5 Geolocation 100k           -4661. 17628. 19841. NA     NA    
 6 Demographics 100k          -4629. 17698. 23054. NA     NA    
 7 Intercept 100k w/HO        -4270. 16061. 18296.  0.355  0.289
 8 Geolocation 100k w/HO      -4185. 15958. 19887.  0.364  0.289
 9 Demographics 100k w/HO     -4140. 15640. 25540.  0.396  0.299
10 More Geolocation 100k w/HO -4254. 15933. 20128.  0.426  0.294
11 Geo-Demos 100k w/HO        -4173. 15900. 18768.  0.4    0.302
12 More Geo-Demos 100k w/HO   -4124. 15686. 19527.  0.46   0.301

Improvement in fit for the single geolocation covariate washes out when we let the model more time to converge, but including the (cleaned) raw geolocation covariates appears to have a big improvement for predictive fit, especially the hit rate. This is compounded by the benefit of including both geolocation and demographic covariates.

Where does the geolocation actually make an impact? I'm glad you asked, it's ... [PLACEHOLDER FOR COOL VISUALIZATIONS.]

Some questions I have:

paulstat6 commented 5 years ago

Here are my thoughts:

[Dynata]http://www.dynata.com/

Edward 'Paul' Johnson Director, Product Analytics

O:+1.801.379.4017 M:+1.801.380.2642 F:+1.801.379.5073

dynata.comhttp://www.dynata.com

Dynata | 3300 N Ashton Blvd, Suite 350 | Lehi, Utah, 84043, United States of America

The information contained in this e-mail message is intended for the use of the recipient(s) named above and is privileged and confidential. If you are not the intended recipient, you are formally notified that you have received this message in error and that any review, dissemination, distribution, or copying of the message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the original message. From: Marc Dotson notifications@github.com Sent: Monday, June 3, 2019 4:39 PM To: marcdotson/modeling-heterogeneity modeling-heterogeneity@noreply.github.com Cc: Edward Johnson Edward.Johnson@Dynata.com; Mention mention@noreply.github.com Subject: Re: [marcdotson/modeling-heterogeneity] Initial model runs (#9)

@adam-n-smithhttps://github.com/adam-n-smith @paulstat6https://github.com/paulstat6 here are the results from some much-longer model runs over the weekend (100k instead of 20k iterations). The fit statistics are appended to the previous table, with the addition of hit rate and hit probability as hold-out sample predictive fit statistics.

There are two additional models:

Here are the results:

model lmd dic waic hr hp

1 Intercept -5218. 19889. 29507. NA NA 2 Geolocation -5187. 19539. 44644. NA NA 3 Demographics -5252. 19984. 26938. NA NA 4 Intercept 100k -4662. 17487. 23747. NA NA 5 Geolocation 100k -4661. 17628. 19841. NA NA 6 Demographics 100k -4629. 17698. 23054. NA NA 7 Intercept 100k w/HO -4270. 16061. 18296. 0.355 0.289 8 Geolocation 100k w/HO -4185. 15958. 19887. 0.364 0.289 9 Demographics 100k w/HO -4140. 15640. 25540. 0.396 0.299 10 More Geolocation 100k w/HO -4254. 15933. 20128. 0.426 0.294 11 Geo-Demos 100k w/HO -4173. 15900. 18768. 0.4 0.302 12 More Geo-Demos 100k w/HO -4124. 15686. 19527. 0.46 0.301 Improvement in fit for the single geolocation covariate washes out when we let the model more time to converge, but including the (cleaned) raw geolocation covariates appears to have a big improvement for predictive fit, especially the hit rate. This is compounded by the benefit of including both geolocation and demographic covariates. Where does the geolocation actually make an impact? I'm glad you asked, it's ... [PLACEHOLDER FOR COOL VISUALIZATIONS.] Some questions I have: * If this is an either/or, should we just worry about presenting about geolocation vs. demographics/something else? * What do we want to compare the geolocation model to? We pitched this as a comparison to stated preferences, not demographics, so should we be running a model with some form of Q2.7? Are there any other covariates we should be using? * What is the appropriate measure of fit? Here I'm using actual hold-out respondents, so we're seeing how well the model can generalize to new data rather than the typical hold-out task. I argue what we're using is the right measure of fit, since it puts weight on the upper-level (which we're drawing the betas from for the hold-out respondents) instead of the lower-level model when we use hold-out tasks (where we largely ignore the upper-level and use the individual-level part-worths). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
paulstat6 commented 5 years ago

Here are the outline for my slides:

Should be about 10 slides. I will try to get it this week, but might be on Monday.

[Dynata]http://www.dynata.com/

Edward 'Paul' Johnson Director, Product Analytics

O:+1.801.379.4017 M:+1.801.380.2642 F:+1.801.379.5073

dynata.comhttp://www.dynata.com

Dynata | 3300 N Ashton Blvd, Suite 350 | Lehi, Utah, 84043, United States of America

The information contained in this e-mail message is intended for the use of the recipient(s) named above and is privileged and confidential. If you are not the intended recipient, you are formally notified that you have received this message in error and that any review, dissemination, distribution, or copying of the message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the original message. From: Marc Dotson notifications@github.com Sent: Monday, June 3, 2019 4:39 PM To: marcdotson/modeling-heterogeneity modeling-heterogeneity@noreply.github.com Cc: Edward Johnson Edward.Johnson@Dynata.com; Mention mention@noreply.github.com Subject: Re: [marcdotson/modeling-heterogeneity] Initial model runs (#9)

@adam-n-smithhttps://github.com/adam-n-smith @paulstat6https://github.com/paulstat6 here are the results from some much-longer model runs over the weekend (100k instead of 20k iterations). The fit statistics are appended to the previous table, with the addition of hit rate and hit probability as hold-out sample predictive fit statistics.

There are two additional models:

Here are the results:

model lmd dic waic hr hp

1 Intercept -5218. 19889. 29507. NA NA 2 Geolocation -5187. 19539. 44644. NA NA 3 Demographics -5252. 19984. 26938. NA NA 4 Intercept 100k -4662. 17487. 23747. NA NA 5 Geolocation 100k -4661. 17628. 19841. NA NA 6 Demographics 100k -4629. 17698. 23054. NA NA 7 Intercept 100k w/HO -4270. 16061. 18296. 0.355 0.289 8 Geolocation 100k w/HO -4185. 15958. 19887. 0.364 0.289 9 Demographics 100k w/HO -4140. 15640. 25540. 0.396 0.299 10 More Geolocation 100k w/HO -4254. 15933. 20128. 0.426 0.294 11 Geo-Demos 100k w/HO -4173. 15900. 18768. 0.4 0.302 12 More Geo-Demos 100k w/HO -4124. 15686. 19527. 0.46 0.301 Improvement in fit for the single geolocation covariate washes out when we let the model more time to converge, but including the (cleaned) raw geolocation covariates appears to have a big improvement for predictive fit, especially the hit rate. This is compounded by the benefit of including both geolocation and demographic covariates. Where does the geolocation actually make an impact? I'm glad you asked, it's ... [PLACEHOLDER FOR COOL VISUALIZATIONS.] Some questions I have: * If this is an either/or, should we just worry about presenting about geolocation vs. demographics/something else? * What do we want to compare the geolocation model to? We pitched this as a comparison to stated preferences, not demographics, so should we be running a model with some form of Q2.7? Are there any other covariates we should be using? * What is the appropriate measure of fit? Here I'm using actual hold-out respondents, so we're seeing how well the model can generalize to new data rather than the typical hold-out task. I argue what we're using is the right measure of fit, since it puts weight on the upper-level (which we're drawing the betas from for the hold-out respondents) instead of the lower-level model when we use hold-out tasks (where we largely ignore the upper-level and use the individual-level part-worths). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
marcdotson commented 5 years ago

Here's the latest table. The preliminary result is clear that the geolocation data is adding to the model's ability to improve prediction, which is our stated purpose, one that we'll get a better read on once we have the actual validation data.

# A tibble: 14 x 6
   model                                              lmd    dic   waic     hr     hp
   <chr>                                            <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Intercept                                       -5218. 19889. 29507. NA     NA    
 2 Geolocation                                     -5187. 19539. 44644. NA     NA    
 3 Demographics                                    -5252. 19984. 26938. NA     NA    
 4 Intercept 100k                                  -4662. 17487. 23747. NA     NA    
 5 Geolocation 100k                                -4661. 17628. 19841. NA     NA    
 6 Demographics 100k                               -4629. 17698. 23054. NA     NA    
 7 Intercept 100k w/HO                             -4270. 16061. 18296.  0.355  0.289
 8 Geolocation 100k w/HO                           -4185. 15958. 19887.  0.364  0.289
 9 Demographics 100k w/HO                          -4140. 15640. 25540.  0.396  0.299
10 More Geolocation 100k w/HO                      -4254. 15933. 20128.  0.426  0.294
11 Geo-Demos 100k w/HO                             -4173. 15900. 18768.  0.4    0.302
12 More Geo-Demos 100k w/HO                        -4124. 15686. 19527.  0.46   0.301
13 Brands 100k w/HO                                -4051. 14777. 32896.  0.476  0.316
14 Geolocation, Brands, and Demographics 100k w/HO -3941. 14629. 35764.  0.504  0.322

I'm still working on figuring out the best way to visualize the differences. @adam-n-smith is also providing some general visualizations of the data. Please do get me a draft of your slides by Monday, @paulstat6, so we have some time to iterate on the presentation.

paulstat6 commented 5 years ago

You got it. Good to know it is helping.

Sent from my iPhone

[Dynata]http://www.dynata.com/

Edward 'Paul' Johnson Director, Product Analytics

O:+1.801.379.4017 M:+1.801.380.2642 F:+1.801.379.5073

dynata.comhttp://www.dynata.com

Dynata | 3300 N Ashton Blvd, Suite 350 | Lehi, Utah, 84043, United States of America

The information contained in this e-mail message is intended for the use of the recipient(s) named above and is privileged and confidential. If you are not the intended recipient, you are formally notified that you have received this message in error and that any review, dissemination, distribution, or copying of the message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the original message. On Jun 8, 2019, at 5:19 PM, Marc Dotson notifications@github.com<mailto:notifications@github.com> wrote:

Here's the latest table. The preliminary result is clear that the geolocation data is adding to the model's ability to improve prediction, which is our stated purpose, one that we'll get a better read on once we have the actual validation data.

A tibble: 14 x 6

model lmd dic waic hr hp

1 Intercept -5218. 19889. 29507. NA NA 2 Geolocation -5187. 19539. 44644. NA NA 3 Demographics -5252. 19984. 26938. NA NA 4 Intercept 100k -4662. 17487. 23747. NA NA 5 Geolocation 100k -4661. 17628. 19841. NA NA 6 Demographics 100k -4629. 17698. 23054. NA NA 7 Intercept 100k w/HO -4270. 16061. 18296. 0.355 0.289 8 Geolocation 100k w/HO -4185. 15958. 19887. 0.364 0.289 9 Demographics 100k w/HO -4140. 15640. 25540. 0.396 0.299 10 More Geolocation 100k w/HO -4254. 15933. 20128. 0.426 0.294 11 Geo-Demos 100k w/HO -4173. 15900. 18768. 0.4 0.302 12 More Geo-Demos 100k w/HO -4124. 15686. 19527. 0.46 0.301 13 Brands 100k w/HO -4051. 14777. 32896. 0.476 0.316 14 Geolocation, Brands, and Demographics 100k w/HO -3941. 14629. 35764. 0.504 0.322 I'm still working on figuring out the best way to visualize the differences. @adam-n-smith is also providing some general visualizations of the data. Please do get me a draft of your slides by Monday, @paulstat6, so we have some time to iterate on the presentation. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
marcdotson commented 5 years ago

Following up with comments from the presentation at ART Forum:

marcdotson commented 3 years ago

Merged initial model results with PR #12.