pymc-labs / CausalPy

A Python package for causal inference in quasi-experimental settings
https://causalpy.readthedocs.io
Apache License 2.0
914 stars 65 forks source link

Get model results data #418

Open Guilherme-dL opened 1 month ago

Guilherme-dL commented 1 month ago

Is there a way to access the synthetic control fitted values as a pandas series or list? I need to create my own custom plots in a specific style, so I need access to the data for the fitted values and the confidence interval.

drbenvincent commented 1 month ago

Hi. Yes this is definitely possible. You can access the raw MCMC samples in result.idata. Sounds like you'll want the posterior predictive samples. You can then calculate the mean and use arviz to calculate the HDI's then export to a csv for example.

Is that enough to go on, or are you requesting this as a utility function?

Guilherme-dL commented 1 month ago

Thank you for the answer. I'm sorry for asking basic questions, I'm not very experienced in programming. From what I understand, the fitted model data is in posterior_predictive.y_hat? But then I get all the MCMC samples? The one that is plotted by default is the last one? What about the credible interval that is also plotted? How can I access it? Essentially I'm trying to recreate the first plot that appears in the default result.plot method, which is the only one I need, but I need to change the style for a specific one, which is why i'm trying to get the data.

drbenvincent commented 1 month ago

No problem. I'll consider this a feature request and see if I (or someone else) can have a go at implementing it.

Guilherme-dL commented 1 month ago

Thank you very much. That would be greatly helpful

lpoug commented 1 month ago

Hello there! I have been lurking around here for some time now and I feel this could be my time to shine! I would very much like to contribute and, if I'm not mistaken, this could be an easy enough task to tackle as a first one.

In any case, I'll try and think about the proper way to do so but please let me know should you have any suggestion @drbenvincent. Thank you!

drbenvincent commented 1 month ago

Sounds great @lpoug

I think it would be really simple to do on an experiment by experiment basic. But it's worth thinking if we can come up with a generic method that would work across all experiments. If it's doable we should try to do that.

What do you think? Happy to discuss more, but thought I'd reply quickly to keep the ball rolling.

lpoug commented 1 month ago

Hey @drbenvincent!

I have given some thoughts about this, here they are.

I'm wondering if there is a use for such a functionality for all experiments. I can clearly see it for the prepostfit experiments as I have encountered the same issue when doing synthetic control (i.e. to reproduce the graphs in the Making Inference part of this chapter of Causal Inference for the Brave and True. I'm not so sure about the other ones and, to be honest, my knowledge of the other experiments is not as sharp as the one for SC.

With that being said:

Hope this was clear enough, let me know what you think! In the mean time, I'll try to dive deeper into the other experiments.

drbenvincent commented 1 month ago

Thanks @lpoug I think because we've only had one request for this so far, we can probably just implement for the PrePostFit class, so synthetic control and interrupted time series.

And perhaps right now we could just create a utility function (that lives in plot_utils.py) rather than adding it in as a method. So the API would be something like:

from causalpy import export_plot_data

export_plot_data(result, target_filepath)

I guess we can add some checks on the inputs being provided. Looks like all the data needed for reproducing those plots could be saved in a csv file. The only exception would be the intervention date. So it could be worth considering exporting a yaml file which can do nested data and lists etc. But totally happy to leave those kind of details to you - that's just a thought I had.

drbenvincent commented 1 month ago

Or, rather than saving, the API could be

intervention_date, plot_data = export_plot_data(result)

Maybe that's a simpler place to start and addresses @Guilherme-dL 's original request.

lpoug commented 1 month ago

Thanks for the suggestions @drbenvincent!

I have done all the steps in the CONTRIBUTING.md to set up my local environment without any trouble, which is already a success for me 😄

Some thoughts on your suggestions: I do think that the better usage would be to assign the function to a plot_data variable rather than directly saving it to a .csv. At least this has been my case. The function could return a pandas dataframe that is basically the dataframe provided in the data argument + new assigned columns predictions and impact. I might be biased so happy to be challengend on this!

With respect to the intervention date, I think that in most cases it already exists in a variable within the notebook or script (and if not, it is quite easily accessible in self.treatment_time, so it might be redundant to return it through the function.

Finally, regarding the location of the function, tbh I'm happy to follow your suggestion at this point. That being said, I'll start working on it to see if my opinion evolves.

drbenvincent commented 1 month ago

Sounds good. All I'd say is that the function name shouldn't start with plot, it should really be a verb that describes what the function does. So something like get_plot_data?

lpoug commented 4 weeks ago

Hi @drbenvincent!

You can find below what I have done so far. I do have a couple more questions before opening a PR if that's ok with you. In fact, I have tried implementing the function in plot_utils.py and directly within the PrePostFit class. In both cases, the function returns the dataset initially provided in the experiment (available through self.datapre and self.datapost) with 2 additional columns prediction and impact.

  1. In plot_utils.py

    def get_prepostfit_data(result) -> pd.DataFrame:
    """
    Utility function to recover the data of a PrePostFit experiment along with the prediction and causal impact information.
    
    :param result:
        The result of a PrePostFit experiment
    """
    
    from causalpy.experiments.prepostfit import PrePostFit
    from causalpy.pymc_models import PyMCModel
    
    if isinstance(result, PrePostFit):
        pre_data = result.datapre.copy()
        post_data = result.datapost.copy()
    
        if isinstance(result.model, PyMCModel):
            pre_data["prediction"] = (
                az.extract(
                    result.pre_pred, group="posterior_predictive", var_names="mu"
                )
                .mean("sample")
                .values
            )
            post_data["prediction"] = (
                az.extract(
                    result.post_pred, group="posterior_predictive", var_names="mu"
                )
                .mean("sample")
                .values
            )
            pre_data["impact"] = result.pre_impact.mean(dim=["chain", "draw"]).values
            post_data["impact"] = result.post_impact.mean(dim=["chain", "draw"]).values
    
        elif isinstance(result.model, RegressorMixin):
            pre_data["prediction"] = result.pre_pred
            post_data["prediction"] = result.post_pred
            pre_data["impact"] = result.pre_impact
            post_data["impact"] = result.post_impact
    
        else:
            raise ValueError("Other model types are not supported")
    
        ppf_data = pd.concat([pre_data, post_data])
    
    else:
        raise ValueError("Other experiments are not supported")
    
    return ppf_data
  2. Within the PrePostFit class

    def get_plot_data(self) -> pd.DataFrame:
        """Recover the data of a PrePostFit experiment along with the prediction and causal impact information.
    
        Internally, this function dispatches to either `get_plot_data_bayesian` or `get_plot_data_ols`
        depending on the model type.
        """
        if isinstance(self.model, PyMCModel):
            return self.get_plot_data_bayesian()
        elif isinstance(self.model, RegressorMixin):
            return self.get_plot_data_ols()
        else:
            raise ValueError("Unsupported model type")
    
    def get_plot_data_bayesian(self) -> pd.DataFrame:
        """
        Recover the data of a PrePostFit experiment along with the prediction and causal impact information.
        """
        if isinstance(self.model, PyMCModel):
            pre_data = self.datapre.copy()
            post_data = self.datapost.copy()
            pre_data["prediction"] = (
                az.extract(
                    self.pre_pred, group="posterior_predictive", var_names="mu"
                )
                .mean("sample")
                .values
            )
            post_data["prediction"] = (
                az.extract(
                    self.post_pred, group="posterior_predictive", var_names="mu"
                )
                .mean("sample")
                .values
            )
            pre_data["impact"] = self.pre_impact.mean(dim=["chain", "draw"]).values
            post_data["impact"] = self.post_impact.mean(dim=["chain", "draw"]).values
    
            self.data_plot = pd.concat([pre_data, post_data])
    
            return self.data_plot
        else:
            raise ValueError("Unsupported model type")
    
    def get_plot_data_ols(self) -> pd.DataFrame:
        """
        Recover the data of a PrePostFit experiment along with the prediction and causal impact information.
        """
        pre_data = self.datapre.copy()
        post_data = self.datapost.copy()
        pre_data["prediction"] = self.pre_pred
        post_data["prediction"] = self.post_pred
        pre_data["impact"] = self.pre_impact
        post_data["impact"] = self.post_impact
        self.data_plot = pd.concat([pre_data, post_data])
    
        return self.data_plot

I think there is a point to be made to have the function within the PrePostFit class in order to avoid any confusion as to its potential use for other experiments' classes. What do you think? If so, I tried to reproduce the logic of a global function that dispatches to either a bayesian or ols version (similar to plot()). Maybe it's too much and a single function get_plot_data() that handles both cases would be sufficient for that particular purpose?

Let me know what you think when you have some time 😃

drbenvincent commented 2 weeks ago

Hi @lpoug

I think having separate get_plot_data_ols and get_plot_data_bayesian functions make sense. The underlying data is stored/represented in different ways, so has to be handled differently.

The get_plot_data_bayesian might also want to get the HDI's, but maybe that was on your roadmap.

I think I agree, having get_plot_data in the PrePostFit class might be better. It could be good to have a corresponding method in the BaseExperiment class which by default gives a NotImplementedError. That would then be overridden by what you define in the PrePostFit.