Validating models using car ownership and recontact survey data

marcdotson commented 3 years ago

Note that the validation data is car ownership data, not car purchase data.

How old of a vehicle (across vehicles owned) do we want to consider "new" purchases?
Do we want to filter on visits as well as ownership records?
How do we treat missing data? It isn't missing at random.
Should we recontact respondents to get stated vs. observed data about car ownership and/or purchase?

marcdotson commented 3 years ago

A recommended reference from @adam-n-smith.

marcdotson commented 3 years ago

Let's get clear on the sources of data:

Initial survey.
Appended and cleaned geographic information for each dealership.
Ownership (not purchase data) pulled a year and a half after the survey (from Vehicle 1 Make to Vehicle 8 Year).
Recontact survey (starting with REC_Q1).

@cwjohnson1 and @z-wix, let's do some exploratory data analysis of the ownership and recontact survey. Use the model-validation branch and 02_exploratory-data-analysis.R script.

cwjohnson1 commented 3 years ago

I just created two pull requests for some of the code I've been working on. Let me know both your thoughts / questions as well as what the plans are in proceeding. Thanks!

cwjohnson1 commented 3 years ago

I know we mentioned wanting to compare how participants said they would purchase vs. how they actually purchased. I can do some analyses on that next. Are there any other thoughts?

marcdotson commented 3 years ago

@cwjohnson1 no need to create a separate branch -- please just work in model-validation. I've merged your changes back into this branch. I'm digging into the changes now and will provide updates in our weekly meeting.

marcdotson commented 3 years ago

Here's a sketch of how to use the ownership and recontact survey data as a validation task.

Combine the recontact survey data and ownership data to produce a validation choice composed of car brand and year.
Complete a validation task by appending an outside option so we can get a hit rate based on predicting the choice or the outside option.
For the subset of total initial respondents for whom we have this validation task, use their betas (or draw their betas, if they are a hold-out respondent) for brand and year to compute predictive fit.
This will result in two sets of validation predictive fit metrics — predictive fit for “in-sample" respondents and predictive fit for hold-out respondents.

There are a lot of things to figure out in here in terms of matching respondents to their in-sample and hold-out data, recoding open-ends and checking for spelling mistakes, and conditioning just on the brand and year attributes.

cwjohnson1 commented 3 years ago

I just realized that I committed the plots, but never pushed them. Sorry about that. you should be able to find them on the Sawtooth-2021.Rmd now.

marcdotson commented 3 years ago

@cwjohnson1 please don't create new branches. You can add this all to model-validation.

marcdotson commented 3 years ago

Notes on computing predictive fit using the validation task:

I decided to use just the first vehicle in the ownership data for the validation task. It’s possible that we could use more than one vehicle, but we don’t know anything about the order in which these are recorded and it would be difficult to consider how to construct an apples-to-apples comparison when most of the ownership data, and all of the recontact survey, only have one vehicle.
There are 18 respondents who overlap between the recontact survey and the ownership data. For 17/18 of them, the recontact survey indicates a newer purchase than the ownership data, so when using both the ownership and recontact data I use either the recontact data or the ownership data.
Using both the recontact survey and ownership data, there are 16 hold-out respondents and 156 “in-sample” respondents.
Note again that the validation task consists of a brand/year vehicle choice, compared to an outside good. This means I use a subset of the beta draws for the “in-sample” respondents and a subset of the upper-level coefficient matrix when drawing betas. Because a subset of the upper-level covariance matrix is no longer symmetric (and it’s typical to only use posterior means when drawing betas for hold-out respondents) I substitute the upper-level covariance matrix with an identity matrix.

cwjohnson1 commented 3 years ago

I just uploaded 2 new plots to the Sawtooth-2021.RMD and am working on some more. I know Zach was working with the recontact data, but since he's working on another project now, I can also plot some visualizations for those data as well if you'd like.

marcdotson commented 3 years ago

Please do, @cwjohnson1.

marcdotson commented 3 years ago

Questions about constructing a validation task from recontact/ownership data:

How do you re-construct a validation task when you have purchase/ownership data?
Do we need to do something with the betas since we're using a subset of the attributes?
Should we re-run models with the validation task just for in-sample/hold-out respondents?

marcdotson commented 3 years ago

Short-term options:

Need to filter the ownership data based on the SUV/cross-over category.
Only include none as an option if we include all respondents from the recontact survey, including those who didn't purchase (that's the none option).

Long-term options:

Would need to get information on all of the same attributes to create a real validation task.
Construct a validation set with "canonical example" for each brand-by-year combination without a none option.

cwjohnson1 commented 3 years ago

I just added the code for the ownership data visualizations, like we talked about, to the presentation folder under the model validation branch. The code for the recontact visualizations are found in 02_exploratory-data-analysis.R. Would you like me to add that code as well for the sake of finding it easier?

marcdotson commented 3 years ago

No, that's fine. Thanks!

marcdotson / modeling-heterogeneity

Validating models using car ownership and recontact survey data #11