jhelvy / cbcTools

An R package with tools for designing choice based conjoint (cbc) survey experiments and conducting power analyses
https://jhelvy.github.io/cbcTools/
Other
5 stars 5 forks source link

Power Analysis Simulation comparison between optimal designs with zero priors and true priors #26

Open mmardehali opened 11 months ago

mmardehali commented 11 months ago

EDIT: I made a mistake by not including the true priors in cbc_choices. I will not touch this post because I think it is still interesting to see how cbc_choices behaves. I added the priors and ran the power analysis simulation again. The new results as well as the section of the code that was updated are in the second comment. In summary, it seems like "fixing" the code by including true priors, actually made the simulation results even more conservative.

Background: I designed a DZ-Efficient DCE (Bayesian D-Efficient with zero priors) using {cbcTools}. Using cbc_choices and cbc_power, I performed a power analysis simulation to determine an optimal sample size. Based on the output, it seemed like with about 180 participants I should be able to achieve standard errors of less than 0.05 for the model coefficients. Based on the recommended best practices in the literature, I conducted a pilot study to estimate the true priors, with 1/10th of the sample size (N=18). After analyzing the results, I found that almost all estimated coefficients had achieved very high statistical significance. For some, the p-value was <0.001, and the rest were less than 0.05, except for one factor that did not achieve significance (however, based on other indicators in the study, this factor truly did not impact the decisions of my respondents). The discrepancy between the simulation results and true results is understandable, since zero priors were identified for the optimization. With zero priors, cbc_choices has no way of knowing how the choices are likely to be made. As a result, the estimated impact of the sample size on the standard errors for coefficients is on the "very conservative" side. To explore this discrepancy further, I compared the results of power analysis simulation, with zero priors, and with the true priors from the pilot study.

I will be renaming some attributes and levels for simplification and readability, otherwise, everything else will remain true to the actual procedures and findings.

Study Parameters: Attributes and Levels: Cost ($20, $25, $30), Type (A, B), UserRatings (3.2 Stars, 4.0 Stars, 4.8 Stars), Information (Unavailable, Low, Medium, High) Note: Type and Information are categorical attributes, Cost and UserRatings are numerical attributes Choice Matrix Shape: 16 choice tasks, each with 4 profiles (16x4) No Choice Option: False

Pilot Results: These coefficient estimates will be used as true priors.

Model Type:    Multinomial Logit
Model Space:          Preference
Model Run:                1 of 1
Iterations:                   20
Elapsed Time:        0h:0m:0.01s
Algorithm:        NLOPT_LD_LBFGS
Weights Used?:             FALSE
Robust?                    FALSE

Model Coefficients: 
                     Estimate Std. Error z-value  Pr(>|z|)    
Cost              -0.130720   0.048061 -2.7199  0.006530 ** 
TypeA              0.416037   0.260832  1.5950  0.110703    
UserRatings        3.241950   0.357835  9.0599 < 2.2e-16 ***
InformationLow    -1.794934   0.629253 -2.8525  0.004338 ** 
InformationMedium  1.893437   0.403223  4.6958 2.656e-06 ***
InformationHigh    5.237012   0.574275  9.1194 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Log-Likelihood:         -112.5016203
Null Log-Likelihood:    -399.2527760
AIC:                     237.0032407
BIC:                     258.9810000
McFadden R2:               0.7182196
Adj McFadden R2:           0.7031915
Number of Observations:  288.0000000

Power Analysis Simulation Comparison (Results): This is very interesting! My estimation was that including true priors will improve things drastically. Evidently, it only made it worse! Also noteworthy, the DB-Error also went from 0.34 (zero priors) to 0.54 (true priors)!

image

Power Analysis Simulation Comparison (Code):

library(cbcTools)
library(logitr)

Nresp = 200
Nchoice = 16
Nalt = 4
Nbreaks = 20
HeadN <- Nalt*Nchoice
set.seed(6132)

# Profiles for design with zero priors
# UserRatings need to range and vary similar to Cost, otherwise there will be balance issues due to zero priors
DZ_profiles <- cbc_profiles(
  Cost = seq(20, 30, 5),
  Type = c("A", "B"),
  UserRatings = seq(20, 30, 5),
  Information = c("Unavailable", "High", "Medium", "Low")
)

DZ_rstrct_profiles <- cbc_restrict(
  DZ_profiles,
  Cost == 20 & Type == "A" & UserRatings == 30 & Information == "High",
  Cost == 30 & Type == "B" & UserRatings == 20 & Information == "Unavailable"
)

# Profiles for design with true priors
# UserRatings can now take their true values since priors will not be zero anymore
DB_profiles <- cbc_profiles(
  Cost = seq(20, 30, 5),
  Type = c("B", "A"),
  UserRatings = seq(3.2, 4.8, 0.8),
  Information = c("Unavailable", "High", "Medium", "Low")
)

DB_rstrct_profiles <- cbc_restrict(
  DB_profiles,
  Cost == 20 & Type == "A" & UserRatings == 4.8 & Information == "High",
  Cost == 30 & Type == "B" & UserRatings == 3.2 & Information == "Unavailable"
)

# Create DZ-Efficient design
DZ_design <- cbc_design(
  profiles = DZ_rstrct_profiles,
  n_resp = Nresp,
  n_alts = Nalt,
  n_q = Nchoice,
  n_start = 10,
  priors = list(
    Cost = 0,
    Type = 0,
    UserRatings = 0,
    Information = c(0, 0, 0)
  ),
  max_iter = 1000,
  method = "Modfed",
  keep_db_error = TRUE,
  parallel = TRUE
)
DZerr <- as.numeric(DZ_design$db_err)
DZ_design_dataframe <- as.data.frame(DZ_design$design)
first_design_DZ <- head(DZ_design_dataframe, HeadN)

# Create DB-Efficient design
DB_design <- cbc_design(
  profiles = DB_rstrct_profiles,
  n_resp = Nresp,
  n_alts = Nalt,
  n_q = Nchoice,
  n_start = 10,
  priors = list(
    Cost = -0.13,
    Type = 0, # did not achieve significance in pilot
    UserRatings = 3.2,
    Information = c(5.2, 1.9, -1.8)
  ),
  max_iter = 1000,
  method = "Modfed",
  keep_db_error = TRUE,
  parallel = TRUE
)
DBerr <- as.numeric(DB_design$db_err)
DB_design_dataframe <- as.data.frame(DB_design$design)
first_design_DB <- head(DB_design_dataframe, HeadN)

# Examining balance and overlap
cbc_balance(first_design_DZ)
cbc_overlap(first_design_DZ)

cbc_balance(first_design_DB)
cbc_overlap(first_design_DB)

# DZ choice simulation
choice_sim_DZ <- cbc_choices(
  design = DZ_design_dataframe,
  obsID = "obsID"
)
choice_sim_DZ <- as.data.frame(choice_sim_DZ)

# DB choice simulation
choice_sim_DB <- cbc_choices(
  design = DB_design_dataframe,
  obsID = "obsID"
)
choice_sim_DB <- as.data.frame(choice_sim_DB)

# Power analysis simulation and comparison
powerDZ <- cbc_power(
  data = choice_sim_DZ,
  pars = c("Cost", "Type", "UserRatings", "Information"),
  outcome = "choice",
  obsID = "obsID",
  nbreaks = Nbreaks,
  n_q = Nchoice,
)

powerDB <- cbc_power(
  data = choice_sim_DB,
  pars = c("Cost", "Type", "UserRatings", "Information"),
  outcome = "choice",
  obsID = "obsID",
  nbreaks = Nbreaks,
  n_q = Nchoice,
)

plot_compare_power(powerDZ, powerDB)
mmardehali commented 11 months ago

Updates to the original post:

Everything in the post above is still accurate up until the "Power Analysis Simulation Comparison (Results)" section. However, even though I included the priors in designing the DB-Efficient choice matrix (with true priors, see DB_design above), I forgot to include the same priors when simulating choices using cbc_choices. To address this issue, I fixed the code and ran the simulation again. Please see the results below.

UPDATED Power Analysis Simulation Comparison (Results): It seems like including true priors when simulating choices made the results of the power analysis simulation even more conservative, which I did not expect. If I understand this correctly, this means that after including priors in both the design of my choice matrix, and in my choice simulation, I would need even more participants to go below the 0.05 standard error threshold. I think this should not be the case, and if anything, I should need less participants because that's the entire purpose of using Bayesian D-Efficient designs. image

UPDATED Power Analysis Simulation Comparison (Code): Here is the only change I made to the script above, specifically to choice_sim_DB:

# DB choice simulation
choice_sim_DB <- cbc_choices(
  design = DB_design_dataframe,
  obsID = "obsID",
  priors = list(
    Cost = -0.13,
    Type = 0, # did not achieve significance in pilot
    UserRatings = 3.2,
    Information = c(5.2, 1.9, -1.8)
  )
)
choice_sim_DB <- as.data.frame(choice_sim_DB)
jhelvy commented 11 months ago

I haven't looked too closely at everything above, but if you're using priors in the creation of the design with cbc_design(), you should also use those priors in the simulation of choices with cbc_choices(). Otherwise your choices are being randomly generated (effectively all 0s for the utility model coefficients), which means the design with priors will probably be worse than a zero prior model.

mmardehali commented 11 months ago

That's precisely the issue I addressed in the update :) In the original post I made a mistake by not including those priors. In the update, I fixed the code and presented the new results.

jhelvy commented 11 months ago

And you're still getting that the DB design has worse performance? That's certainly not what would be expected.

mmardehali commented 11 months ago

I believe so! If you compare the plots in the update vs the original post, it seems like the SE has increased for any given N, for most attributes. The only difference was fixing my oversight between the two runs, so the updated plots are now with priors included in cbc_choices.

Edit: And that's just the comparison between runs. Within runs, it also seems like including priors in most cases increases SE vs N compared to zero priors.

jhelvy commented 11 months ago

My only other thought as to why this may be occuring is because you've used restricted profiles. Does this still occur without restrictions? I'd like to see a simple example (maybe the apple one I use in the documentation) where a DB and DZ design are compared where the only differences are the priors used to simulate the choices. Btw, I believe you should be using the same priors to simulate choices for both designs.

mmardehali commented 11 months ago

I don't know much about the structure of the apples dataset but I can definitely look into that! I'm not sure if I'm understanding correctly... Are you suggesting that I should be using the priors I have for choice simulation even for the DZ design where 0 priors were assumed originally? If so, I'm not sure how realistic that would be because when I designed the DZ study I didn't know what the priors are going to be. Maybe I'm missing something here...

Currently, I'm comparing the power analysis simulation for: DZ: Design with 0 priors, using 0 priors for choice simulation VS DB: Design with true priors, using true priors for choice simulation.

Are you suggesting that I should compare the following instead? Design with 0 priors, using true priors for choice simulation VS Design with true priors, using true priors for choice simulation

jhelvy commented 11 months ago

Yes. If you optimize a DZ design, then it will perform best when the true parameters are zero. So your simulation where you use 0 for the priors of the simulated choices is as optimistic as it could get. That design will perform worse if the true parameters are not zero (which they most likely aren't).

Instead, I tend to use a fixed set of parameters when simulation choices because those are the true parameters that I expect I will encounter when respondents complete my survey. I definitely don't expect they will have all 0s. So I would use the same priors when simulating choices regardless of the design.

mmardehali commented 11 months ago

It seems fair to use zero priors but at least provide best-guesses for priors in choice simulation. When priors are provided, cbc_chocies and cbc_power do seem to approximate SE vs N correctly. But it is unclear to me how the output of cbc_power (the SE vs N plots) can be used in estimating a proper sample size, and the SE=0.5 line on the plots initially mislead me into thinking that I should target a sample size that results in SE below that threshold. Would it be appropriate for cbc_power to output p-value vs N plots instead? In my case, it seems like the DB plots are approximating SE values relatively correctly (compared to the pilot study results presented). However, I don't think at any point I would've been able to estimate that I will achieve statistical significance for my coefficients, solely based on the SE vs N plots. Is there something that I'm missing in my interpretation?

jhelvy commented 11 months ago

You know what, you're totally right! Your comment made me realize that the code I used before to make these figures was left over from a very long time ago where I was only checking to make sure that the standard errors fell with increasing N (just to visually see if things were working as expected). I never updated the code to plot the p values. Looking at standard errors is not really that informative - we should be looking at p-value vs. N to assess sample size requirements. This has been this way for so long now I'm embarrassed I missed it!

Alright, I'm going to fix this. I think cbc_power() should return the entire summary table (coefficients, standard errors, & p-value) along with sample size. I'll make an update and will post it soon.

mmardehali commented 11 months ago

Frankly I'm glad I wasn't embarrassing myself by raising an irrelevant concern! Thank you for addressing this!

Just a couple of thoughts:

  1. I'm not sure how presenting entire summary tables will be beneficial, because those tables will change based on the number of participants (or nbreaks in this particular context). So either the summary tables need to be presented for each "break", or only the last one with the maximum number of participants needs to be shown. Which means either there will be too many tables to study, or one table that is not helpful in estimating a viable minimum N. I think showing the p-value vs N plot will be way more informative, unless there is a benefit to showing summary tables that I'm missing.
  2. Maybe to increase the quality-of-life for the users, a message can be shown such as Approximated minimum viable sample size to achieve statistical significance for all coefficients (p<0.05): X. In which X is the largest N that allows all coefficients to have a p-value less than 0.05. I think it's perfectly possible to find this N based on the p vs N plot. But if there are too many attributes (or categorical attributes), the user would have to find that N for each sub-plot, and then pick the largest N. Again, I don't think this is difficult at all, but it adds yet another possibility for error. If this can be automated I think it would make the function more straightforward to use (with a huge disclaimer that this is just an estimation and users should aim for larger sample sizes to be on the safe side).
jhelvy commented 11 months ago

By summary table I meant return all of the information in the summary tables. Right now cbc_power() returns a data frame that includes the coefficient estimate and standard error for each sample size, but I could very easily add the t stat and the p-values in that data frame. That would provide all of the summary information from each estimated model. Then the plot method included in the function could be more general where you can ask for a specific plot, like p-value vs. N or standard error vs. N, etc.

I am thinking of re-structuring what is returned from cbc_power() to be a more complex list object that includes the estimated models as well as the summary table mentioned above. Then you could have a summary() method that prints out information like what you proposed about the approximate minimum sample size needed to achieve a desired level of significance, etc.

Unfortunately classes are now starting for me, and I have a very heavy teaching load this fall. So this is probably all going to have to wait for the time being. I also use this package in the class I teach, so I'm hesitant to make considerable modifications so as to avoid introducing bugs.

For sure all of this would be a much needed improvement to the package. In doing some preliminary changes to look at p-values vs. N I'm getting strange results where the p-values are all very high, even at high sample sizes. Good chance I just haven't got the code right yet though.