Open marcdotson opened 5 years ago
Notes from @adam-n-smith's and my discussion.
Stuff @marcdotson is looking at, working with the design:
@paulstat6, if you could answer these questions about geographic information:
How did you construct the variables we have from the data you gathered? We have points of interests marked by Brand. When a person visited that point of interest (either by Safegraph measurement or by internal footprint measurement) we count it towards that brand.
What time period does the geolocation cover? 6 months. From 11/1/2018 to 4/30/2019.
Aren't there multiple brands at a given dealership? Sometimes there are, in which case they are counted with each brand.
[Dynata]http://www.dynata.com/
Edward 'Paul' Johnson Director, Product Analytics
O:+1.801.379.4017 M:+1.801.380.2642 F:+1.801.379.5073
dynata.comhttp://www.dynata.com
Dynata | 3300 N Ashton Blvd, Suite 350 | Lehi, Utah, 84043, United States of America
The information contained in this e-mail message is intended for the use of the recipient(s) named above and is privileged and confidential. If you are not the intended recipient, you are formally notified that you have received this message in error and that any review, dissemination, distribution, or copying of the message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the original message. From: Marc Dotson notifications@github.com Sent: Tuesday, May 21, 2019 10:03 AM To: marcdotson/modeling-heterogeneity modeling-heterogeneity@noreply.github.com Cc: Edward Johnson Edward.Johnson@Dynata.com; Mention mention@noreply.github.com Subject: Re: [marcdotson/modeling-heterogeneity] Initial model runs (#9)
Notes from @adam-n-smithhttps://github.com/adam-n-smith's and my discussion.
Stuff @marcdotsonhttps://github.com/marcdotson is looking at, working with the design:
@paulstat6https://github.com/paulstat6, if you could answer these questions about geographic information:
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/marcdotson/modeling-heterogeneity/issues/9?email_source=notifications&email_token=ALLGUUFY5YMMXLCA6VXUF5DPWQMJPA5CNFSM4HMUCR5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV4MIZA#issuecomment-494453860, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALLGUUE6MDBYUPKCOKGFJN3PWQMJPANCNFSM4HMUCR5A.
@adam-n-smith @paulstat6 here are some initial results. Note that I've included the actual prices in order to reduce the dimensions of the design matrix. I've scaled down the price by 10,000. You can see what I've done in Code in the initial-model-runs branch.
The three models are:
Here are the initial results:
model lmd dic waic
<chr> <dbl> <dbl> <dbl>
1 Intercept -5218. 19889. 29507.
2 Geolocation -5187. 19539. 44644.
3 Demographics -5252. 19984. 26938.
In terms of LMD and DIC, the Geolocation model is performing marginally better. WAIC is wacky, but it could just be acting strange (using the loo
package, which doesn't always perform well for computing the WAIC for hierarchical models).
Here's what I'm doing next:
@marcdotson @paulstat6 Good plan. I think it would also be good to fit a model with both demographics and geolocation.
I'm interested in seeing parameter estimates for the upper-level model to see where these geolocation variables actually pick up traction.
@adam-n-smith @paulstat6 here are the results from some much-longer model runs over the weekend (100k instead of 20k iterations). The fit statistics are appended to the previous table, with the addition of hit rate and hit probability as hold-out sample predictive fit statistics.
There are two additional models:
Here are the results:
model lmd dic waic hr hp
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Intercept -5218. 19889. 29507. NA NA
2 Geolocation -5187. 19539. 44644. NA NA
3 Demographics -5252. 19984. 26938. NA NA
4 Intercept 100k -4662. 17487. 23747. NA NA
5 Geolocation 100k -4661. 17628. 19841. NA NA
6 Demographics 100k -4629. 17698. 23054. NA NA
7 Intercept 100k w/HO -4270. 16061. 18296. 0.355 0.289
8 Geolocation 100k w/HO -4185. 15958. 19887. 0.364 0.289
9 Demographics 100k w/HO -4140. 15640. 25540. 0.396 0.299
10 More Geolocation 100k w/HO -4254. 15933. 20128. 0.426 0.294
11 Geo-Demos 100k w/HO -4173. 15900. 18768. 0.4 0.302
12 More Geo-Demos 100k w/HO -4124. 15686. 19527. 0.46 0.301
Improvement in fit for the single geolocation covariate washes out when we let the model more time to converge, but including the (cleaned) raw geolocation covariates appears to have a big improvement for predictive fit, especially the hit rate. This is compounded by the benefit of including both geolocation and demographic covariates.
Where does the geolocation actually make an impact? I'm glad you asked, it's ... [PLACEHOLDER FOR COOL VISUALIZATIONS.]
Some questions I have:
Here are my thoughts:
If this is an either/or, should we just worry about presenting about geolocation vs. demographics/something else? Not an either/or in my opinion. I think that most of the applications that we are dealing with would have both geolocation (passive) and demographics (stated).
What do we want to compare the geolocation model to? We pitched this as a comparison to stated preferences, not demographics, so should we be running a model with some form of Q2.7? Are there any other covariates we should be usinghttps://github.com/marcdotson/modeling-heterogeneity/blob/master/Data/Survey.md? I agree I think that we should be comparing adding stated brand preference into the model. So it would be demographics + stated brand preference VS demographics +geographic data VS demographics + stated brand preference + geographic data. Some of the other stated preference data we could use would be
Q1 and Q2.3 types of car – affect brand utilities?
Q2.1 and Q2.2 for used vs new car – affect the utilities on the miles?
Q2.6 for reasons for the car
Q2.7 brand – affect brand utilities?
Q2.8 price – affect price utility, maybe use it as an elbow point?
What is the appropriate measure of fit? Here I'm using actual hold-out respondents, so we're seeing how well the model can generalize to new data rather than the typical hold-out task. I argue what we're using is the right measure of fit, since it puts weight on the upper-level (which we're drawing the betas from for the hold-out respondents) instead of the lower-level model when we use hold-out tasks (where we largely ignore the upper-level and use the individual-level part-worths). I think that using actual holdout respondents works well. Eventually it will be against the automotive data from Polk, but until then actual holdout respondents works as a good proxy in my opinion.
[Dynata]http://www.dynata.com/
Edward 'Paul' Johnson Director, Product Analytics
O:+1.801.379.4017 M:+1.801.380.2642 F:+1.801.379.5073
dynata.comhttp://www.dynata.com
Dynata | 3300 N Ashton Blvd, Suite 350 | Lehi, Utah, 84043, United States of America
The information contained in this e-mail message is intended for the use of the recipient(s) named above and is privileged and confidential. If you are not the intended recipient, you are formally notified that you have received this message in error and that any review, dissemination, distribution, or copying of the message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the original message. From: Marc Dotson notifications@github.com Sent: Monday, June 3, 2019 4:39 PM To: marcdotson/modeling-heterogeneity modeling-heterogeneity@noreply.github.com Cc: Edward Johnson Edward.Johnson@Dynata.com; Mention mention@noreply.github.com Subject: Re: [marcdotson/modeling-heterogeneity] Initial model runs (#9)
@adam-n-smithhttps://github.com/adam-n-smith @paulstat6https://github.com/paulstat6 here are the results from some much-longer model runs over the weekend (100k instead of 20k iterations). The fit statistics are appended to the previous table, with the addition of hit rate and hit probability as hold-out sample predictive fit statistics.
There are two additional models:
Here are the results:
model lmd dic waic hr hp
Here are the outline for my slides:
Should be about 10 slides. I will try to get it this week, but might be on Monday.
[Dynata]http://www.dynata.com/
Edward 'Paul' Johnson Director, Product Analytics
O:+1.801.379.4017 M:+1.801.380.2642 F:+1.801.379.5073
dynata.comhttp://www.dynata.com
Dynata | 3300 N Ashton Blvd, Suite 350 | Lehi, Utah, 84043, United States of America
The information contained in this e-mail message is intended for the use of the recipient(s) named above and is privileged and confidential. If you are not the intended recipient, you are formally notified that you have received this message in error and that any review, dissemination, distribution, or copying of the message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the original message. From: Marc Dotson notifications@github.com Sent: Monday, June 3, 2019 4:39 PM To: marcdotson/modeling-heterogeneity modeling-heterogeneity@noreply.github.com Cc: Edward Johnson Edward.Johnson@Dynata.com; Mention mention@noreply.github.com Subject: Re: [marcdotson/modeling-heterogeneity] Initial model runs (#9)
@adam-n-smithhttps://github.com/adam-n-smith @paulstat6https://github.com/paulstat6 here are the results from some much-longer model runs over the weekend (100k instead of 20k iterations). The fit statistics are appended to the previous table, with the addition of hit rate and hit probability as hold-out sample predictive fit statistics.
There are two additional models:
Here are the results:
model lmd dic waic hr hp
Here's the latest table. The preliminary result is clear that the geolocation data is adding to the model's ability to improve prediction, which is our stated purpose, one that we'll get a better read on once we have the actual validation data.
# A tibble: 14 x 6
model lmd dic waic hr hp
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Intercept -5218. 19889. 29507. NA NA
2 Geolocation -5187. 19539. 44644. NA NA
3 Demographics -5252. 19984. 26938. NA NA
4 Intercept 100k -4662. 17487. 23747. NA NA
5 Geolocation 100k -4661. 17628. 19841. NA NA
6 Demographics 100k -4629. 17698. 23054. NA NA
7 Intercept 100k w/HO -4270. 16061. 18296. 0.355 0.289
8 Geolocation 100k w/HO -4185. 15958. 19887. 0.364 0.289
9 Demographics 100k w/HO -4140. 15640. 25540. 0.396 0.299
10 More Geolocation 100k w/HO -4254. 15933. 20128. 0.426 0.294
11 Geo-Demos 100k w/HO -4173. 15900. 18768. 0.4 0.302
12 More Geo-Demos 100k w/HO -4124. 15686. 19527. 0.46 0.301
13 Brands 100k w/HO -4051. 14777. 32896. 0.476 0.316
14 Geolocation, Brands, and Demographics 100k w/HO -3941. 14629. 35764. 0.504 0.322
I'm still working on figuring out the best way to visualize the differences. @adam-n-smith is also providing some general visualizations of the data. Please do get me a draft of your slides by Monday, @paulstat6, so we have some time to iterate on the presentation.
You got it. Good to know it is helping.
Sent from my iPhone
[Dynata]http://www.dynata.com/
Edward 'Paul' Johnson Director, Product Analytics
O:+1.801.379.4017 M:+1.801.380.2642 F:+1.801.379.5073
dynata.comhttp://www.dynata.com
Dynata | 3300 N Ashton Blvd, Suite 350 | Lehi, Utah, 84043, United States of America
The information contained in this e-mail message is intended for the use of the recipient(s) named above and is privileged and confidential. If you are not the intended recipient, you are formally notified that you have received this message in error and that any review, dissemination, distribution, or copying of the message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the original message. On Jun 8, 2019, at 5:19 PM, Marc Dotson notifications@github.com<mailto:notifications@github.com> wrote:
Here's the latest table. The preliminary result is clear that the geolocation data is adding to the model's ability to improve prediction, which is our stated purpose, one that we'll get a better read on once we have the actual validation data.
model lmd dic waic hr hp
Following up with comments from the presentation at ART Forum:
Merged initial model results with PR #12.
The initial model runs are a comparison of standard covariates vs. geolocation covariates. Some concerns from @paulstat6:
And some feedback from @adam-n-smith:
I'll start some initial model runs and documenting things in preparation for ART Forum.