Theoretical questions and tips for the code

Emnlv commented 3 years ago

Hi Sibyl,

Hoping you are well. I have some theoretical questions about the model and some suggestions for the code.

About Marketing Mix Model (model 2):

If I have to benchmark different models, which metrics do you suggest to use? Every time I check for the MAPE and RMSE are always pretty similar.
How can we evaluate the impact of each control variable on the baseline?
In your dataset, your Media is often ON. What happens if we have interspersed Media values with many 0s from campaign to campaign? In this case if our lag_effect willl not sufficient to cover this gap, the cumulative effect will be 0. For this last problem, I decided to change the STAN code, it works, but I am not sure if it is the right approach. Here you can find the changed code:

“””// adstock, mean-center, log1p transformation row_vector[max_lag] lag_weights; for (nn in 1:N) { for (media in 1 : num_media) { for (lag in 1 : max_lag) { lag_weights[max_lag-lag+1] <- pow(decay[media], (lag - 1 - peak[media]) ^ 2); } cum_effect <- Adstock(sub_col(X_media, nn, media, max_lag), lag_weights); if (cum_effect == 0) { X_media_adstocked[nn, media] <- 0; } else { X_media_adstocked[nn, media] <- log1p(cum_effect/mu_mdip[media]); } } X <- append_col(X_media_adstocked, X_ctrl); } } “”” Instead of this modification I made, do you suggest other solutions? For example, creating a vector of cumulative effect, transform it to avoid 0 and then pass it through the formula (log1p(cum_effect/mu_mdip[media), could be effective?

How can we deal if we have variables that are negative and positive as, e.g., seasonality (variation from negative to positive). Do you think a transformation as (X-X.min())/(X.max()-X.min()) can be a right approach, or do you suggest to keep the mean transformation?
Why do we need to apply a log_mean_center transformation in the second model?

I think the undermentioned parts can improve the code to avoid some errors.

Control Model (model 1) I think this solution can help, if we want to use only one variable for one specific beta, otherwise problems can arise. I suggest transforming every control variable X in the matrix form to obtain the shape. For example: pos_vars = pos_vars = [col for col in base_vars if col not in seas_cols] -> here we have more than one variables X1 = np.matrix(df_ctrl[pos_vars].values).reshape(len(df),(len([pos _vars]) if type(pos_vars) == str else len(pos_vars))))
pn_vars = seas_cols[1] -> we have only one control variable inside X2 = np.matrix(df_ctrl[pn_vars].values).reshape(len(df),(len([pn_vars]) if type(pn_vars) == str else len(pn_vars)))) –> to mantain coherence shape later with the ctrl_data dictionary

ctrl_data = { 'N': len(df_ctrl), 'K1': X1.shape[1], --> instead of len(pos_vars) 'K2': X2.shape[1], --> instead of len(pn_vars) 'X1': X1, 'X2': X2, 'y': df_ctrl[base_sales'].values, 'max_intercept': min(df_ctrl['total_volume']) }

In addition, in every sm.sampling() I would add n_jobs=-1 to run faster the code (if it can be helpful).

As always, Sibyl, thank you very much for your help and for the code you published. You are giving a big help to everyone needs it.

Best regards

sibylhe commented 3 years ago

Thanks for the suggestions! I will incorporate them into my code.

MAPE. Because MAPE is calculated on original data, it's more straightforward (10% means the model prediction is fluctuating 10% above or below the real sales). Meanwhile, it's important to check the adstock parameters, are they reasonable, in line with your domain knowledge? MMM is aimed to explain the causes, not only accuracy matters.
Plug in coefficients of control variables, calculate contribution of each factor and scale to real data.
e,g., impact of gas price = beta_me_gas_dpg * me_gas_dpg Similar to mmm_decompse_contrib(), but much simpler, since control effects are additive.
If a media channel is all 0's, I would drop it instead of passing it to the model. You know this channel will contribute 0, why not save some efforts :)

In my code, it's okay to have some weeks' cum_effect as 0, but not all zeros. In the dataset, the first a few weeks of mdip_so are 0's, and the log1p transformation works. Your problem is you have a media channel with all 0's (I would recommend dropping it), the condition in your if clause should be "if mu_mdip[media]==0 " instead of "if cum_effect == 0", no difference in execution though. It's the average media impression mu_mdip[media] being a 0 divider that makes the log1p crash, not cum_effect == 0.

You mentioned you were doing MMM at campaign level, I would leave a flag on it. MMM is usually done at nationwide, weekly level, and requires 2-3 years' consecutive data. In addition, please note: the X variables must add up to the whole Y. In my model, Y is nationwide sales, X is nationwide channel impression (and control variables). If you want to do it at campaign level, your Y is nationwide sales, your X should be ALL nationwide campaigns' impression, not one specific campaign.

It's okay to have control variables with both positive and negative values, since the control model is additive, no need for log1p transformation. Media variables are non-negative in nature.

For normalization, I use mean centralization because 1. I want the model to focus on the trend, not the absolute number; 2. avoid negative values for log1p. I feel minmax is less related to the trend, because it's not proportional to original data. But you're free to try minmax, maybe I'm wrong. Normalization is optional for regression analysis. You can build the model without normalization.

BTW, seasonality variables don't have negative values, they are 0 or 1. It's their effects may be either positive or negative.

Please see the model specification in section 1.1. I'm building a multiplicative MMM, assuming media effects are multiplicative. In order to transform the multiplicative formula to a linear regression problem, take log on both sides.

Emnlv commented 3 years ago

Hi Sibyl! Thanks a lot for your prompt reply! I will explain myself better about 2 points to have them clear.

1
E.g. First case: I run the second model (Marketing mix) with a base_sales obtained by the first model (Control model) given by the combination of some control variables . At the end of the process of the second model I obtain a MAPE score. Second case: I run again the second model with a base_sales obtained, in this second case, by a set of different control variables w.r.t. the first case. The MAPE I obtained is extremely close to the that one of the first case. How I have to choose the best model? I have also considering the adstock parameter? And, about MAPE for Marketing Mix Model, which are good boundaries? For example a MAPE of 20% have I consider as bad?

3 Sorry, I would mean that e.g. my Media Vector is:[100, 110, 120, 120, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 130, 110, 120,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,80,70,100]. And it is at national level :) In this case, the cumulative effect in some weeks will be 0, but my mu_mdip[media] will be >0. I don't want to extent the lag to cover this issue, but I would prefer to change the STAN code, using a good approach. So, do you think the best option is to leave it as before? Or the modification I made can be good?

Again, thanks a lot Sibyl! Hoping these questions will help the community too.

sibylhe commented 3 years ago

1 For model selection, I think it's more about domain knowledge than accuracy metrics. Some criteria:

adstock
MAPE: generally I think below 15% or 20% is acceptable, sometimes 30%, it depends on your case.
Rhat (Rhat=1 at convergence), n_eff (effective sample size) of parameters: shows if the parameter has a good convergence
domain knowledge, existing theories/findings If you still cannot choose a model, just go ahead test all your findings. Whatever found by MMM are only mathematical solutions, the real world does not necessarily act like this, they have to be validated by A/B testing.

3 If mu_mdip[media] > 0, I think the modification you made makes no difference. I don't quite get what you mean by "don't want to extent the lag to cover this issue", what do you want your transformed data to be like? Whatever is 0 still being 0 after adstock transformation? But the only way to achieve this is to apply no adstock to this channel.

Emnlv commented 3 years ago

Thanks a lot Sibyl!!! Definitely great! I will check also for LOO (also if I am not 100% sure in this case will be useful) to have and additional check!

I don't quite get what you mean by "don't want to extent the lag to cover this issue", what do you want your transformed data to be like? -> In this case, if you want to avoid to have 0 due to long sequence of 0 in your Media real data vector, I think you can set the Lag to "∞" (or better, the length of your df): you can have a cumulative effect always > 0 where from week to week that it gets smaller and smaller, but I think it is not a good approach.

sibylhe commented 3 years ago

I agree it's not a good approach. The "Adstock with Varying Length" plot shows that the impact of length is minor, setting it to be 8 weeks/12 weeks/infinite makes little difference. If a channel's spending/impression is trivial, its model result is not trustworthy, needs to be further tested. Adstock is not for the purpose of avoiding 0's, it's aimed to mimic the carry-over effect that occurs in the real world. Why do you want to avoid 0's, or what's your purpose for changing the adstock? I don't understand your problem and goal, if you let me know, I might have more specific answer.

sibylhe / mmm_stan

Theoretical questions and tips for the code #3