marcdotson commented 5 years ago

The initial model runs are a comparison of standard covariates vs. geolocation covariates. Some concerns from @paulstat6:

Here is the final data set. I am concerned but not really all that surprised that we only have observed 150 of them at a dealership. Let me know if that is concerning to you as well. That is one of the issues with observed data is that you don’t always see the behavior you want to monitor. The counts are appended at the end. Large outliers for those that it looks like they work at a dealership. Might even turn them into binary variables to just say if they were observed visiting in the past 6 months of not.

And some feedback from @adam-n-smith:

Agree — it may be best to start with an indicator of whether they visited any dealership. Doesn’t look like we’ll have enough data to measure the effects of visits at particular dealerships. Is there other information that we could use in the geolocation data? For example, do we know a home/work address to measure how long their commute is? Or even looking at total miles driven in some window of time?

I'll start some initial model runs and documenting things in preparation for ART Forum.

marcdotson commented 5 years ago

Notes from @adam-n-smith's and my discussion.

Stuff @marcdotson is looking at, working with the design:

Can we linearize price?
Any other possible reduction to the design matrix?

@paulstat6, if you could answer these questions about geographic information:

How did you construct the variables we have from the data you gathered?
What time period does the geolocation cover?
Aren't there multiple brands at a given dealership?

paulstat6 commented 5 years ago

How did you construct the variables we have from the data you gathered? We have points of interests marked by Brand. When a person visited that point of interest (either by Safegraph measurement or by internal footprint measurement) we count it towards that brand.
What time period does the geolocation cover? 6 months. From 11/1/2018 to 4/30/2019.
Aren't there multiple brands at a given dealership? Sometimes there are, in which case they are counted with each brand.

[Dynata]http://www.dynata.com/

Edward 'Paul' Johnson Director, Product Analytics

O:+1.801.379.4017 M:+1.801.380.2642 F:+1.801.379.5073

dynata.comhttp://www.dynata.com

Dynata | 3300 N Ashton Blvd, Suite 350 | Lehi, Utah, 84043, United States of America

The information contained in this e-mail message is intended for the use of the recipient(s) named above and is privileged and confidential. If you are not the intended recipient, you are formally notified that you have received this message in error and that any review, dissemination, distribution, or copying of the message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the original message. From: Marc Dotson notifications@github.com Sent: Tuesday, May 21, 2019 10:03 AM To: marcdotson/modeling-heterogeneity modeling-heterogeneity@noreply.github.com Cc: Edward Johnson Edward.Johnson@Dynata.com; Mention mention@noreply.github.com Subject: Re: [marcdotson/modeling-heterogeneity] Initial model runs (#9)

Notes from @adam-n-smithhttps://github.com/adam-n-smith's and my discussion.

Stuff @marcdotsonhttps://github.com/marcdotson is looking at, working with the design:

Can we linearize price?
Any other possible reduction to the design matrix?

@paulstat6https://github.com/paulstat6, if you could answer these questions about geographic information:

How did you construct the variables we have from the data you gathered?
What time period does the geolocation cover?
Aren't there multiple brands at a given dealership?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/marcdotson/modeling-heterogeneity/issues/9?email_source=notifications&email_token=ALLGUUFY5YMMXLCA6VXUF5DPWQMJPA5CNFSM4HMUCR5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV4MIZA#issuecomment-494453860, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALLGUUE6MDBYUPKCOKGFJN3PWQMJPANCNFSM4HMUCR5A.

marcdotson commented 5 years ago

@adam-n-smith @paulstat6 here are some initial results. Note that I've included the actual prices in order to reduce the dimensions of the design matrix. I've scaled down the price by 10,000. You can see what I've done in Code in the initial-model-runs branch.

The three models are:

Intercept: Intercept-only.
Geolocation: Intercept and a single covariate -- whether or not you've visited a dealership.
Demographics: I've included a subset, specifically, Q4.1, Q4.4:Q4.6, Q4.9, Q4.10, Q4.12r1:Q4.12r4.

Here are the initial results:

  model           lmd    dic   waic
  <chr>         <dbl>  <dbl>  <dbl>
1 Intercept    -5218. 19889. 29507.
2 Geolocation  -5187. 19539. 44644.
3 Demographics -5252. 19984. 26938.

In terms of LMD and DIC, the Geolocation model is performing marginally better. WAIC is wacky, but it could just be acting strange (using the loo package, which doesn't always perform well for computing the WAIC for hierarchical models).

Here's what I'm doing next:

Running these same set of models longer now using a random hold-out sample of respondents so we can also compute predictive fit.
Running another geolocation model where we haven't collapsed the data into a single covariate (cleaning out the outliers that have visited > 5 times).
Producing some visualizations to explore how the model estimates differ, beyond model fit statistics.

adam-n-smith commented 5 years ago

@marcdotson @paulstat6 Good plan. I think it would also be good to fit a model with both demographics and geolocation.

I'm interested in seeing parameter estimates for the upper-level model to see where these geolocation variables actually pick up traction.

marcdotson commented 5 years ago

@adam-n-smith @paulstat6 here are the results from some much-longer model runs over the weekend (100k instead of 20k iterations). The fit statistics are appended to the previous table, with the addition of hit rate and hit probability as hold-out sample predictive fit statistics.

There are two additional models:

More Geolocation: A geolocation model where we haven't collapsed the data into a single covariate (but outliers that have visited > 5 times have been set to a max of 5 visits).
(More) Geolocation-Demographics: Including both geolocation and demographic covariates, with the single, collapsed geolocation covariate or the more variant.

Here are the results:

   model                         lmd    dic   waic     hr     hp
   <chr>                       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Intercept                  -5218. 19889. 29507. NA     NA    
 2 Geolocation                -5187. 19539. 44644. NA     NA    
 3 Demographics               -5252. 19984. 26938. NA     NA    
 4 Intercept 100k             -4662. 17487. 23747. NA     NA    
 5 Geolocation 100k           -4661. 17628. 19841. NA     NA    
 6 Demographics 100k          -4629. 17698. 23054. NA     NA    
 7 Intercept 100k w/HO        -4270. 16061. 18296.  0.355  0.289
 8 Geolocation 100k w/HO      -4185. 15958. 19887.  0.364  0.289
 9 Demographics 100k w/HO     -4140. 15640. 25540.  0.396  0.299
10 More Geolocation 100k w/HO -4254. 15933. 20128.  0.426  0.294
11 Geo-Demos 100k w/HO        -4173. 15900. 18768.  0.4    0.302
12 More Geo-Demos 100k w/HO   -4124. 15686. 19527.  0.46   0.301

Improvement in fit for the single geolocation covariate washes out when we let the model more time to converge, but including the (cleaned) raw geolocation covariates appears to have a big improvement for predictive fit, especially the hit rate. This is compounded by the benefit of including both geolocation and demographic covariates.

Where does the geolocation actually make an impact? I'm glad you asked, it's ... [PLACEHOLDER FOR COOL VISUALIZATIONS.]

Some questions I have:

If this is an either/or, should we just worry about presenting about geolocation vs. demographics/something else?
What do we want to compare the geolocation model to? We pitched this as a comparison to stated preferences, not demographics, so should we be running a model with some form of Q2.7? Are there any other covariates we should be using?
What is the appropriate measure of fit? Here I'm using actual hold-out respondents, so we're seeing how well the model can generalize to new data rather than the typical hold-out task. I argue what we're using is the right measure of fit, since it puts weight on the upper-level (which we're drawing the betas from for the hold-out respondents) instead of the lower-level model when we use hold-out tasks (where we largely ignore the upper-level and use the individual-level part-worths).

paulstat6 commented 5 years ago

Here are my thoughts:

If this is an either/or, should we just worry about presenting about geolocation vs. demographics/something else? Not an either/or in my opinion. I think that most of the applications that we are dealing with would have both geolocation (passive) and demographics (stated).
What do we want to compare the geolocation model to? We pitched this as a comparison to stated preferences, not demographics, so should we be running a model with some form of Q2.7? Are there any other covariates we should be usinghttps://github.com/marcdotson/modeling-heterogeneity/blob/master/Data/Survey.md? I agree I think that we should be comparing adding stated brand preference into the model. So it would be demographics + stated brand preference VS demographics +geographic data VS demographics + stated brand preference + geographic data. Some of the other stated preference data we could use would be
Q1 and Q2.3 types of car – affect brand utilities?
Q2.1 and Q2.2 for used vs new car – affect the utilities on the miles?
Q2.6 for reasons for the car
Q2.7 brand – affect brand utilities?
Q2.8 price – affect price utility, maybe use it as an elbow point?
What is the appropriate measure of fit? Here I'm using actual hold-out respondents, so we're seeing how well the model can generalize to new data rather than the typical hold-out task. I argue what we're using is the right measure of fit, since it puts weight on the upper-level (which we're drawing the betas from for the hold-out respondents) instead of the lower-level model when we use hold-out tasks (where we largely ignore the upper-level and use the individual-level part-worths). I think that using actual holdout respondents works well. Eventually it will be against the automotive data from Polk, but until then actual holdout respondents works as a good proxy in my opinion.

[Dynata]http://www.dynata.com/

Edward 'Paul' Johnson Director, Product Analytics

O:+1.801.379.4017 M:+1.801.380.2642 F:+1.801.379.5073

dynata.comhttp://www.dynata.com

Dynata | 3300 N Ashton Blvd, Suite 350 | Lehi, Utah, 84043, United States of America

The information contained in this e-mail message is intended for the use of the recipient(s) named above and is privileged and confidential. If you are not the intended recipient, you are formally notified that you have received this message in error and that any review, dissemination, distribution, or copying of the message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the original message. From: Marc Dotson notifications@github.com Sent: Monday, June 3, 2019 4:39 PM To: marcdotson/modeling-heterogeneity modeling-heterogeneity@noreply.github.com Cc: Edward Johnson Edward.Johnson@Dynata.com; Mention mention@noreply.github.com Subject: Re: [marcdotson/modeling-heterogeneity] Initial model runs (#9)

@adam-n-smithhttps://github.com/adam-n-smith @paulstat6https://github.com/paulstat6 here are the results from some much-longer model runs over the weekend (100k instead of 20k iterations). The fit statistics are appended to the previous table, with the addition of hit rate and hit probability as hold-out sample predictive fit statistics.

There are two additional models:

More Geolocation: A geolocation model where we haven't collapsed the data into a single covariate (but outliers that have visited > 5 times have been set to a max of 5 visits).
(More) Geolocation-Demographics: Including both geolocation and demographic covariates, with the single, collapsed geolocation covariate or the more variant.

Here are the results:

model lmd dic waic hr hp

1 Intercept -5218. 19889. 29507. NA NA 2 Geolocation -5187. 19539. 44644. NA NA 3 Demographics -5252. 19984. 26938. NA NA 4 Intercept 100k -4662. 17487. 23747. NA NA 5 Geolocation 100k -4661. 17628. 19841. NA NA 6 Demographics 100k -4629. 17698. 23054. NA NA 7 Intercept 100k w/HO -4270. 16061. 18296. 0.355 0.289 8 Geolocation 100k w/HO -4185. 15958. 19887. 0.364 0.289 9 Demographics 100k w/HO -4140. 15640. 25540. 0.396 0.299 10 More Geolocation 100k w/HO -4254. 15933. 20128. 0.426 0.294 11 Geo-Demos 100k w/HO -4173. 15900. 18768. 0.4 0.302 12 More Geo-Demos 100k w/HO -4124. 15686. 19527. 0.46 0.301 Improvement in fit for the single geolocation covariate washes out when we let the model more time to converge, but including the (cleaned) raw geolocation covariates appears to have a big improvement for predictive fit, especially the hit rate. This is compounded by the benefit of including both geolocation and demographic covariates. Where does the geolocation actually make an impact? I'm glad you asked, it's ... [PLACEHOLDER FOR COOL VISUALIZATIONS.] Some questions I have: * If this is an either/or, should we just worry about presenting about geolocation vs. demographics/something else? * What do we want to compare the geolocation model to? We pitched this as a comparison to stated preferences, not demographics, so should we be running a model with some form of Q2.7? Are there any other covariates we should be using? * What is the appropriate measure of fit? Here I'm using actual hold-out respondents, so we're seeing how well the model can generalize to new data rather than the typical hold-out task. I argue what we're using is the right measure of fit, since it puts weight on the upper-level (which we're drawing the betas from for the hold-out respondents) instead of the lower-level model when we use hold-out tasks (where we largely ignore the upper-level and use the individual-level part-worths). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

paulstat6 commented 5 years ago

Here are the outline for my slides:

Talk about pro and cons of passive versus stated data
- In particular mention that passive data is not clean and well ordered
Go over the two different types of geolocation passive data used
Go through three rounds of testing on geolocation passive data
- Breath
- Raw Data Comparison
- Implementation
Talk about the competing interests of accuracy and feasibility

Should be about 10 slides. I will try to get it this week, but might be on Monday.

[Dynata]http://www.dynata.com/

Edward 'Paul' Johnson Director, Product Analytics

O:+1.801.379.4017 M:+1.801.380.2642 F:+1.801.379.5073

dynata.comhttp://www.dynata.com

Dynata | 3300 N Ashton Blvd, Suite 350 | Lehi, Utah, 84043, United States of America

The information contained in this e-mail message is intended for the use of the recipient(s) named above and is privileged and confidential. If you are not the intended recipient, you are formally notified that you have received this message in error and that any review, dissemination, distribution, or copying of the message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the original message. From: Marc Dotson notifications@github.com Sent: Monday, June 3, 2019 4:39 PM To: marcdotson/modeling-heterogeneity modeling-heterogeneity@noreply.github.com Cc: Edward Johnson Edward.Johnson@Dynata.com; Mention mention@noreply.github.com Subject: Re: [marcdotson/modeling-heterogeneity] Initial model runs (#9)

@adam-n-smithhttps://github.com/adam-n-smith @paulstat6https://github.com/paulstat6 here are the results from some much-longer model runs over the weekend (100k instead of 20k iterations). The fit statistics are appended to the previous table, with the addition of hit rate and hit probability as hold-out sample predictive fit statistics.

There are two additional models:

More Geolocation: A geolocation model where we haven't collapsed the data into a single covariate (but outliers that have visited > 5 times have been set to a max of 5 visits).
(More) Geolocation-Demographics: Including both geolocation and demographic covariates, with the single, collapsed geolocation covariate or the more variant.

Here are the results:

model lmd dic waic hr hp

1 Intercept -5218. 19889. 29507. NA NA 2 Geolocation -5187. 19539. 44644. NA NA 3 Demographics -5252. 19984. 26938. NA NA 4 Intercept 100k -4662. 17487. 23747. NA NA 5 Geolocation 100k -4661. 17628. 19841. NA NA 6 Demographics 100k -4629. 17698. 23054. NA NA 7 Intercept 100k w/HO -4270. 16061. 18296. 0.355 0.289 8 Geolocation 100k w/HO -4185. 15958. 19887. 0.364 0.289 9 Demographics 100k w/HO -4140. 15640. 25540. 0.396 0.299 10 More Geolocation 100k w/HO -4254. 15933. 20128. 0.426 0.294 11 Geo-Demos 100k w/HO -4173. 15900. 18768. 0.4 0.302 12 More Geo-Demos 100k w/HO -4124. 15686. 19527. 0.46 0.301 Improvement in fit for the single geolocation covariate washes out when we let the model more time to converge, but including the (cleaned) raw geolocation covariates appears to have a big improvement for predictive fit, especially the hit rate. This is compounded by the benefit of including both geolocation and demographic covariates. Where does the geolocation actually make an impact? I'm glad you asked, it's ... [PLACEHOLDER FOR COOL VISUALIZATIONS.] Some questions I have: * If this is an either/or, should we just worry about presenting about geolocation vs. demographics/something else? * What do we want to compare the geolocation model to? We pitched this as a comparison to stated preferences, not demographics, so should we be running a model with some form of Q2.7? Are there any other covariates we should be using? * What is the appropriate measure of fit? Here I'm using actual hold-out respondents, so we're seeing how well the model can generalize to new data rather than the typical hold-out task. I argue what we're using is the right measure of fit, since it puts weight on the upper-level (which we're drawing the betas from for the hold-out respondents) instead of the lower-level model when we use hold-out tasks (where we largely ignore the upper-level and use the individual-level part-worths). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

marcdotson commented 5 years ago

Here's the latest table. The preliminary result is clear that the geolocation data is adding to the model's ability to improve prediction, which is our stated purpose, one that we'll get a better read on once we have the actual validation data.

# A tibble: 14 x 6
   model                                              lmd    dic   waic     hr     hp
   <chr>                                            <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Intercept                                       -5218. 19889. 29507. NA     NA    
 2 Geolocation                                     -5187. 19539. 44644. NA     NA    
 3 Demographics                                    -5252. 19984. 26938. NA     NA    
 4 Intercept 100k                                  -4662. 17487. 23747. NA     NA    
 5 Geolocation 100k                                -4661. 17628. 19841. NA     NA    
 6 Demographics 100k                               -4629. 17698. 23054. NA     NA    
 7 Intercept 100k w/HO                             -4270. 16061. 18296.  0.355  0.289
 8 Geolocation 100k w/HO                           -4185. 15958. 19887.  0.364  0.289
 9 Demographics 100k w/HO                          -4140. 15640. 25540.  0.396  0.299
10 More Geolocation 100k w/HO                      -4254. 15933. 20128.  0.426  0.294
11 Geo-Demos 100k w/HO                             -4173. 15900. 18768.  0.4    0.302
12 More Geo-Demos 100k w/HO                        -4124. 15686. 19527.  0.46   0.301
13 Brands 100k w/HO                                -4051. 14777. 32896.  0.476  0.316
14 Geolocation, Brands, and Demographics 100k w/HO -3941. 14629. 35764.  0.504  0.322

I'm still working on figuring out the best way to visualize the differences. @adam-n-smith is also providing some general visualizations of the data. Please do get me a draft of your slides by Monday, @paulstat6, so we have some time to iterate on the presentation.

paulstat6 commented 5 years ago

You got it. Good to know it is helping.

Sent from my iPhone

[Dynata]http://www.dynata.com/

Edward 'Paul' Johnson Director, Product Analytics

O:+1.801.379.4017 M:+1.801.380.2642 F:+1.801.379.5073

dynata.comhttp://www.dynata.com

Dynata | 3300 N Ashton Blvd, Suite 350 | Lehi, Utah, 84043, United States of America

The information contained in this e-mail message is intended for the use of the recipient(s) named above and is privileged and confidential. If you are not the intended recipient, you are formally notified that you have received this message in error and that any review, dissemination, distribution, or copying of the message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the original message. On Jun 8, 2019, at 5:19 PM, Marc Dotson notifications@github.com<mailto:notifications@github.com> wrote:

Here's the latest table. The preliminary result is clear that the geolocation data is adding to the model's ability to improve prediction, which is our stated purpose, one that we'll get a better read on once we have the actual validation data.

A tibble: 14 x 6

model lmd dic waic hr hp

1 Intercept -5218. 19889. 29507. NA NA 2 Geolocation -5187. 19539. 44644. NA NA 3 Demographics -5252. 19984. 26938. NA NA 4 Intercept 100k -4662. 17487. 23747. NA NA 5 Geolocation 100k -4661. 17628. 19841. NA NA 6 Demographics 100k -4629. 17698. 23054. NA NA 7 Intercept 100k w/HO -4270. 16061. 18296. 0.355 0.289 8 Geolocation 100k w/HO -4185. 15958. 19887. 0.364 0.289 9 Demographics 100k w/HO -4140. 15640. 25540. 0.396 0.299 10 More Geolocation 100k w/HO -4254. 15933. 20128. 0.426 0.294 11 Geo-Demos 100k w/HO -4173. 15900. 18768. 0.4 0.302 12 More Geo-Demos 100k w/HO -4124. 15686. 19527. 0.46 0.301 13 Brands 100k w/HO -4051. 14777. 32896. 0.476 0.316 14 Geolocation, Brands, and Demographics 100k w/HO -3941. 14629. 35764. 0.504 0.322 I'm still working on figuring out the best way to visualize the differences. @adam-n-smith is also providing some general visualizations of the data. Please do get me a draft of your slides by Monday, @paulstat6, so we have some time to iterate on the presentation. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

marcdotson commented 5 years ago

Following up with comments from the presentation at ART Forum:

Recode the geolocation covariates to account for people who visit many different dealerships (i.e., shopping around).
Consider other ways we might use geolocation as a separate model of search.

marcdotson commented 3 years ago

Merged initial model results with PR #12.

marcdotson / modeling-heterogeneity

Initial model runs #9

A tibble: 14 x 6