budget allocation with no control vars

wd60622 commented 2 months ago

Discussed in https://github.com/pymc-labs/pymc-marketing/discussions/1028

^{Originally posted by **bella0715** September 12, 2024} Here's how I built the model. ``` mmm = DelayedSaturatedMMM( model_config = mmm_config, sampler_config = sampler_config, date_column = date_var, channel_columns = spend_vars, control_columns = None, adstock_max_lag=8, yearly_seasonality=1, ) ``` When I try to run the below code to do the budget allocation, I get an error 'UnboundLocalError: cannot access local variable '_controls' where it is not associated with a value'. How do I run the budget allocation without control_columns? ``` response = mmm.allocate_budget_to_maximize_response( budget=total_budget, num_days=8, time_granularity="weekly", budget_bounds=budget_bounds, ) ```

wd60622 commented 2 months ago

It seems like part of the problem: https://github.com/pymc-labs/pymc-marketing/blob/936270958f79fd3cce0488ae2301f2d8f3e2a35f/pymc_marketing/mmm/mmm.py#L2074-L2077

AlfredoJF commented 2 months ago

Curious about what would be an optimal solution from the experts. In the meanwhile, sharing my workaround for this issue where I added a few new methods to the MMM class, extended others, and used historical control variables plus random noise using the same approach as in _create_synth_dataset.

Find last year's start_date_comparison based on last_date https://github.com/pymc-labs/pymc-marketing/blob/d5fa54348816404668182f34d0bee72ca20bbc25/pymc_marketing/mmm/mmm.py#L2079:

from dateutil.relativedelta import relativedelta

start_date_comparison = last_date - relativedelta(years=1)

Get comparison df from train dataset X

from datetime import datetime

def get_comparison_df(
        self,
        start_date_comparison: datetime = None,
        num_periods: int = None,
) -> tuple[pd.DataFrame]:

    end_date = start_date_comparison + pd.DateOffset(weeks=num_periods)

    date_filter_str = (f"{self.date_column} >= {start_date_comparison.strftime('%Y%m%d')} and "
                       f"{self.date_column} <= {end_date.strftime('%Y%m%d')}")

    df_train_comparison = self.X.query(date_filter_str).copy()

    return df_train_comparison

Added **kwargs to allocate_budget_to_maximize_response and the _create_synth_dataset call in allocate_budget_to_maximize_response to support new params without too many modifications to the original methods.


def allocate_budget_to_maximize_response(
self,
...        
**kwargs,
) -> az.InferenceData:
...
    synth_dataset = self._create_synth_dataset(
        df=self.X,
        date_column=self.date_column,
        allocation_strategy=self.optimal_allocation_dict,
        channels=self.channel_columns,
        controls=self.control_columns,
        target_col=self.output_var,
        time_granularity=time_granularity,
        time_length=num_periods,
        lag=self.adstock.l_max,
        noise_level=noise_level,
        **kwargs,
    )
...
return ...


- Added `start_date_comparison` and `df_train_comparison` to method `_create_synth_dataset`, and added the workaround logic:
```python
    def _create_synth_dataset(
        self,
        ...
        start_date_comparison: datetime | str | None = None,
        df_train_comparison: pd.DataFrame | None = None,
    ) -> pd.DataFrame:
        """Create a synthetic dataset based on the given allocation strategy (Budget) and time granularity.

        Parameters
        ----------
        ...
        start_date_comparison : datetime | str | None
            A date from the synthetic dataset will be created.
        df_train_comparison : pd.DataFrame | None
            A dataframe from a previous year from the train dataset
        """
        ...

        if start_date_comparison is not None:
            last_date = pd.to_datetime(start_date_comparison).tz_localize(None)
        else:
            last_date = pd.to_datetime(df[date_column]).max()  # ln:2079

        ...

        new_rows = [
            ...
        ]  # ln:2108

        # Add historical control variables plus random noise
        if df_train_comparison is not None:

            synth_dataset = pd.DataFrame(new_rows)

            for control in self.control_columns:
                synth_dataset[control] = [value + np.random.normal(0, noise_level * value)
                                          for value in df_train_comparison[control].to_list()]

            return synth_dataset

        else:
            return pd.DataFrame(new_rows)  # ln: 2110

Hope this is not too convoluted and self-explanatory.

Happy to hear your thoughts and optimal solution.

wd60622 commented 3 weeks ago

Seems like a pretty good workaround. Would you want to make a PR for this @AlfredoJF?

Seems like a simple edit around these lines might do the trick:

https://github.com/pymc-labs/pymc-marketing/blob/af946dfa8687a65018ddd4d708f434a7f32f30ab/pymc_marketing/mmm/mmm.py#L2091-L2094

pymc-labs / pymc-marketing

budget allocation with no control vars #1030

Discussed in https://github.com/pymc-labs/pymc-marketing/discussions/1028