Open wd60622 opened 1 month ago
It seems like part of the problem: https://github.com/pymc-labs/pymc-marketing/blob/936270958f79fd3cce0488ae2301f2d8f3e2a35f/pymc_marketing/mmm/mmm.py#L2074-L2077
Curious about what would be an optimal solution from the experts. In the meanwhile, sharing my workaround for this issue where I added a few new methods to the MMM
class, extended others, and used historical control variables plus random noise using the same approach as in _create_synth_dataset
.
last_date
https://github.com/pymc-labs/pymc-marketing/blob/d5fa54348816404668182f34d0bee72ca20bbc25/pymc_marketing/mmm/mmm.py#L2079:from dateutil.relativedelta import relativedelta
start_date_comparison = last_date - relativedelta(years=1)
Get comparison df from train dataset X
from datetime import datetime
def get_comparison_df(
self,
start_date_comparison: datetime = None,
num_periods: int = None,
) -> tuple[pd.DataFrame]:
end_date = start_date_comparison + pd.DateOffset(weeks=num_periods)
date_filter_str = (f"{self.date_column} >= {start_date_comparison.strftime('%Y%m%d')} and "
f"{self.date_column} <= {end_date.strftime('%Y%m%d')}")
df_train_comparison = self.X.query(date_filter_str).copy()
return df_train_comparison
**kwargs
to allocate_budget_to_maximize_response
and the _create_synth_dataset
call in allocate_budget_to_maximize_response
to support new params without too many modifications to the original methods.
def allocate_budget_to_maximize_response(
self,
...
**kwargs,
) -> az.InferenceData:
...
synth_dataset = self._create_synth_dataset(
df=self.X,
date_column=self.date_column,
allocation_strategy=self.optimal_allocation_dict,
channels=self.channel_columns,
controls=self.control_columns,
target_col=self.output_var,
time_granularity=time_granularity,
time_length=num_periods,
lag=self.adstock.l_max,
noise_level=noise_level,
**kwargs,
)
...
return ...
- Added `start_date_comparison` and `df_train_comparison` to method `_create_synth_dataset`, and added the workaround logic:
```python
def _create_synth_dataset(
self,
...
start_date_comparison: datetime | str | None = None,
df_train_comparison: pd.DataFrame | None = None,
) -> pd.DataFrame:
"""Create a synthetic dataset based on the given allocation strategy (Budget) and time granularity.
Parameters
----------
...
start_date_comparison : datetime | str | None
A date from the synthetic dataset will be created.
df_train_comparison : pd.DataFrame | None
A dataframe from a previous year from the train dataset
"""
...
if start_date_comparison is not None:
last_date = pd.to_datetime(start_date_comparison).tz_localize(None)
else:
last_date = pd.to_datetime(df[date_column]).max() # ln:2079
...
new_rows = [
...
] # ln:2108
# Add historical control variables plus random noise
if df_train_comparison is not None:
synth_dataset = pd.DataFrame(new_rows)
for control in self.control_columns:
synth_dataset[control] = [value + np.random.normal(0, noise_level * value)
for value in df_train_comparison[control].to_list()]
return synth_dataset
else:
return pd.DataFrame(new_rows) # ln: 2110
Hope this is not too convoluted and self-explanatory.
Happy to hear your thoughts and optimal solution.
Seems like a pretty good workaround. Would you want to make a PR for this @AlfredoJF?
Seems like a simple edit around these lines might do the trick:
Discussed in https://github.com/pymc-labs/pymc-marketing/discussions/1028