py-why / dowhy

DoWhy is a Python library for causal inference that supports explicit modeling and testing of causal assumptions. DoWhy is based on a unified language for causal inference, combining causal graphical models and potential outcomes frameworks.
https://www.pywhy.org/dowhy
MIT License
6.95k stars 924 forks source link

Support of time series data? #174

Closed rmitsch closed 3 years ago

rmitsch commented 3 years ago

I was wondering how to perform causal inference with DoWhy on time series data. I'm aware there are (few) specialized packages like e.g. https://github.com/tcassou/causal_impact to this end. I understand that DoWhy does not offer dedicated support for time series data data, but am curious about how to utilize it for time series data though.

Specifically I would like to know:

Thanks for the great work!

amit-sharma commented 3 years ago

@rmitsch DoWhy is a general package for causal inference, so it can also be applied for time series data. But the catch is that you'll have to do a lot of data processing yourself so that the data is in a form that can be input to DoWhy's models.

This notebook presents a good example. The underlying input data is time-series data, but the user has pre-processed the data to yield columns such as 'signup_month' and the 'aggregated transaction amount'. Once the dataset has been processed to have semantically meaningful variables, then it becomes easier to draw a plausible graph connecting those variables, and conduct the analysis.

Let me try to answer your specific questions one by one:

  1. The key assumption that DoWhy makes is that all rows of the data are sampled i.i.d. from some distribution. The raw time-series data would violate that assumption since the consecutive rows are dependent on each other. To resolve it, we need to make assumptions about how the data at time t depends on previous values (and optionally, also aggregate the data somewhat to make analysis easier).

2) One common assumption is the Markov assumption that claims: given the previous value of a time-series, its current value is independent of all other values before it. Your suggestion of including the previous value as a column corresponds to this Markov assumption, and can work well. Of course, you can stretch this assumption to include previous k>1 values. If you do this, you still need to come up with a plausible graph connecting the previous values to the current value. The linked notebook does it in a simple way by aggregating all previous and future activity for any given month, and then assuming that the aggregate previous activity causes the current activity.

  1. How to model the previous outcome in the graph? That depends on your problem domain. Typically, in health scenarios, the previous outcome affects the current treatment (e.g., a doctor may decide the current dosage of medicine based on patient's previous outcome with the same medicine). However, in other cases, the treatment may be independent of the outcome but dependent on other factors in the previous time-step (e.g., there may be standard way to increase the dosage and thus the doctor may decide the current dosage of medicine based on the prior dosage). So the exact graph modelling would have to depend on your problem. If you can share details about your problem and an example graph you have in mind, happy to share my comments on the graph.

For some examples, you can look at slide 47 of this KDD tutorial on causal inference. For a more comprehensive reference, you can look at Section III of Hernan and Robins book.

That said, there are still some aspects of time-series analysis that will be difficult to model using DoWhy. For example, periodicity of trends, weekday/weekend differences or special trends (e.g., holiday spikes) will be difficult to model using the methods in DoWhy. This is something we'd like to add in DoWhy, but so far it is not supported.

rmitsch commented 3 years ago

@amit-sharma Thanks for the helpful reply!

My scenario is as follows: I have hourly data on the sales volume of a certain product for several different stores. I'm interested in the causal impact of pricing on the sales volume for this product, i.e. I'm looking to find the price elasticity of demand. I have data on the price of the raw material needed to fabricate the product in question, which can be used as instrumental variable.

Since my treatment (product price) is continuous, segregating in before/after groups don't seem applicable in this case. I also assume that time-related aspects like month/day of the week/time of the day/etc. play a role - as of now, they are included as one-hot encoded features. I.e. my table is structure like this:

datehour    |    quantity_sold    |    raw_material_price    |    ...    |    is_holiday    |    is_monday    |    ...

My current approach would hence be to add the k previous values to the table to resolve the independence violation, formulate an appropriate graph and process the table as-is, i.e. with one row representing one hour. Aggregating the data to reduce the number of rows would be great, but I don't see how to do that without losing information e.g. the prior k values.

My follow-up questions:

Thank you for your support!

amit-sharma commented 3 years ago

Ah, price elasticity is a classic economics problem. I should clarify that I'm not an economist--so I'll just address your question from the point of view of causal inference.

Since your goal to estimate the effect of price on sales, I don't think you necessarily need time series-specific methods. Such methods are useful when you want to model the relationship of a quantity with time (e.g., trying to predict how a variable will increase over time, or comparing the effect of two treatments over time). In your case, however, the relationship that you are interested in is about price and sales; time is simply a confounder or an effect modifier.

So your formulation makes sense to me. Here are the two key issues to think about: whether this tabular data captures the essence of the time-series pattern, and whether each row can be assumed independent of each other (have we removed the time-based correlation?).

Capturing time-series patterns

Seems reasonable that the time-series specific variables become columns in your data (e.g., is_monday). I suggest thinking about how you can model most of the time-varying effects through such derived variables.

Whether each data point can be considered independent

Your idea of including t-1, t-2, ...t-k values of the variables makes sense here. Essentially, assuming that prices and sales are not affected by any event that happened k time-steps before. Note that there are two assumptions at play here, do verify that they are plausible in your scenario.

How would the graph look?

Once these two issues are resolved, creating the graph is relatively easier. To do that, we need to know what variables from a previous time-step can affect both price and sales. You can simply add all of them as confounders. If you are interested in heterogeneous treatment effects, you can also add a subset of them (e.g., location) as effect modifiers. So the graph would look like:

price(t)->sales(t)
price(t-1) -> sales(t); price(t-1) -> price(t)
sales(t-1) -> price(t); sales(t-1) -> sales(t)
confounder(t-1) -> price(t); confounder(t-1) -> sales(t)

[Confounders can also affect one another over time, but that should not matter for the target causal question.]

For an example of how to estimate effect, you can look at EconML library's case study on price elasticity in this notebook [section 4] on price elasticity using the double machine learning method, which I would also recommend since it seems that you'll have a large number of confounders. You can call the EconML's DML method in DoWhy like this [more examples in this notebook].

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV
from sklearn.ensemble import GradientBoostingRegressor
dml_estimate = model.estimate_effect(identified_estimand, method_name="backdoor.econml.dml.DMLCateEstimator",
                                     control_value = 0,
                                     treatment_value = 1,
                                 confidence_intervals=False,
                                method_params={"init_params":{'model_y':GradientBoostingRegressor(),
                                                              'model_t': GradientBoostingRegressor(),
                                                              "model_final":LassoCV(), 
                                                              'featurizer':PolynomialFeatures(degree=1, include_bias=True)},
                                               "fit_params":{}})

You also mentioned that you are considering raw materials price as an IV. I may be a little careful there---if the raw materials price and the sales are affected by the same confounders, then it may not be a valid IV. If you are confident about the IV, perhaps you can try both backdoor and IV methods, and check whether you obtain similar estimates (as a robustness check).

rmitsch commented 3 years ago

I am reasonable sure about the IV, but will compare with other DML methods. Regarding the suggested causal graph:

It is my understanding the the estimated effect is E(t = 1) - E(t = 0). In the case of a continuous treatment like price that is not restricted to an interval [0, 1] I expect the effect to represent E(t = treatment_value) - E(t = control_value), where treatment_value and control_value correspond to the values specified in estimate_effect(). So for price elasticity I would expect a positive value for estimate.value if treatment_value < control_value, assuming that lower prices increase sales. Is that correct?

amit-sharma commented 3 years ago

That's correct---for continuous treatments you can specify any treatment_value and control_value based on your requirement. If the treatment_value is lower than control_value, you should expect a positive effect. But a more conventional way to do it is to keep the treatment_value as the higher price, and then report a negative effect as evidence of price elasticity.

rmitsch commented 3 years ago

Right, that sounds reasonable. No further questions here, I'll close this issue. Thank you so much for your time :-)