Support of time series data?

I was wondering how to perform causal inference with DoWhy on time series data. I'm aware there are (few) specialized packages like e.g. https://github.com/tcassou/causal_impact to this end. I understand that DoWhy does not offer dedicated support for time series data data, but am curious about how to utilize it for time series data though.

Specifically I would like to know:

Is it permissible to apply the methods for causal inference implemented in DoWhy in time series data or does the latter violate some assumptions about the structure of the data?
Should the dataset representing the time series data contain columns for the previous outcome values?
If so, how should my causal graph represent the previous outcomes?

Thanks for the great work!

@rmitsch DoWhy is a general package for causal inference, so it can also be applied for time series data. But the catch is that you'll have to do a lot of data processing yourself so that the data is in a form that can be input to DoWhy's models.

This notebook presents a good example. The underlying input data is time-series data, but the user has pre-processed the data to yield columns such as 'signup_month' and the 'aggregated transaction amount'. Once the dataset has been processed to have semantically meaningful variables, then it becomes easier to draw a plausible graph connecting those variables, and conduct the analysis.

Let me try to answer your specific questions one by one:

The key assumption that DoWhy makes is that all rows of the data are sampled i.i.d. from some distribution. The raw time-series data would violate that assumption since the consecutive rows are dependent on each other. To resolve it, we need to make assumptions about how the data at time t depends on previous values (and optionally, also aggregate the data somewhat to make analysis easier).

2) One common assumption is the Markov assumption that claims: given the previous value of a time-series, its current value is independent of all other values before it. Your suggestion of including the previous value as a column corresponds to this Markov assumption, and can work well. Of course, you can stretch this assumption to include previous k>1 values. If you do this, you still need to come up with a plausible graph connecting the previous values to the current value. The linked notebook does it in a simple way by aggregating all previous and future activity for any given month, and then assuming that the aggregate previous activity causes the current activity.

How to model the previous outcome in the graph? That depends on your problem domain. Typically, in health scenarios, the previous outcome affects the current treatment (e.g., a doctor may decide the current dosage of medicine based on patient's previous outcome with the same medicine). However, in other cases, the treatment may be independent of the outcome but dependent on other factors in the previous time-step (e.g., there may be standard way to increase the dosage and thus the doctor may decide the current dosage of medicine based on the prior dosage). So the exact graph modelling would have to depend on your problem. If you can share details about your problem and an example graph you have in mind, happy to share my comments on the graph.

For some examples, you can look at slide 47 of this KDD tutorial on causal inference. For a more comprehensive reference, you can look at Section III of Hernan and Robins book.

That said, there are still some aspects of time-series analysis that will be difficult to model using DoWhy. For example, periodicity of trends, weekday/weekend differences or special trends (e.g., holiday spikes) will be difficult to model using the methods in DoWhy. This is something we'd like to add in DoWhy, but so far it is not supported.

@amit-sharma Thanks for the helpful reply!

My scenario is as follows: I have hourly data on the sales volume of a certain product for several different stores. I'm interested in the causal impact of pricing on the sales volume for this product, i.e. I'm looking to find the price elasticity of demand. I have data on the price of the raw material needed to fabricate the product in question, which can be used as instrumental variable.

Since my treatment (product price) is continuous, segregating in before/after groups don't seem applicable in this case. I also assume that time-related aspects like month/day of the week/time of the day/etc. play a role - as of now, they are included as one-hot encoded features. I.e. my table is structure like this:

datehour    |    quantity_sold    |    raw_material_price    |    ...    |    is_holiday    |    is_monday    |    ...

My current approach would hence be to add the k previous values to the table to resolve the independence violation, formulate an appropriate graph and process the table as-is, i.e. with one row representing one hour. Aggregating the data to reduce the number of rows would be great, but I don't see how to do that without losing information e.g. the prior k values.

My follow-up questions:

Do you think my approach is reasonable?
- If so: How would you go about modeling the causality graph?
- If not: What would you propose?
Do you know of a library for causal inference analysis that might be better suited for this task than DoWhy, e.g. because it has explicit time-series support? https://github.com/dafiti/causalimpact is an example, but it relies on the assumption of a pre- and post-intervention phase, which does not mesh well with my use-case of having multiple interventions/price changes.

Thank you for your support!

Ah, price elasticity is a classic economics problem. I should clarify that I'm not an economist--so I'll just address your question from the point of view of causal inference.

Since your goal to estimate the effect of price on sales, I don't think you necessarily need time series-specific methods. Such methods are useful when you want to model the relationship of a quantity with time (e.g., trying to predict how a variable will increase over time, or comparing the effect of two treatments over time). In your case, however, the relationship that you are interested in is about price and sales; time is simply a confounder or an effect modifier.

So your formulation makes sense to me. Here are the two key issues to think about: whether this tabular data captures the essence of the time-series pattern, and whether each row can be assumed independent of each other (have we removed the time-based correlation?).

Capturing time-series patterns

Seems reasonable that the time-series specific variables become columns in your data (e.g., is_monday). I suggest thinking about how you can model most of the time-varying effects through such derived variables.

You are already considering "micro" weekly patterns like is_monday. But you also want to careful about longer periodic patterns like summer versus winter (e.g., for sales of cola drink), or holiday season versus not (e.g., for sale of toys).
As long as you include all these macro-variables for time too, I think you should be good on the "time-series" angle (and make sure to include these as effect modifiers to see the varying elasticity in different months).

Whether each data point can be considered independent

Your idea of including t-1, t-2, ...t-k values of the variables makes sense here. Essentially, assuming that prices and sales are not affected by any event that happened k time-steps before. Note that there are two assumptions at play here, do verify that they are plausible in your scenario.

Price-setting: Any information before k time-steps is irrevelant to setting the price today (may be violated if the seller sets holiday month prices based on their past year's holiday month).
Sales: While making a purchase, customers forget anything that happened k time-steps before (may be violated if customers are basing their decision on last year's lowest price)

How would the graph look?

Once these two issues are resolved, creating the graph is relatively easier. To do that, we need to know what variables from a previous time-step can affect both price and sales. You can simply add all of them as confounders. If you are interested in heterogeneous treatment effects, you can also add a subset of them (e.g., location) as effect modifiers. So the graph would look like:

price(t)->sales(t)
price(t-1) -> sales(t); price(t-1) -> price(t)
sales(t-1) -> price(t); sales(t-1) -> sales(t)
confounder(t-1) -> price(t); confounder(t-1) -> sales(t)

[Confounders can also affect one another over time, but that should not matter for the target causal question.]

For an example of how to estimate effect, you can look at EconML library's case study on price elasticity in this notebook [section 4] on price elasticity using the double machine learning method, which I would also recommend since it seems that you'll have a large number of confounders. You can call the EconML's DML method in DoWhy like this [more examples in this notebook].

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV
from sklearn.ensemble import GradientBoostingRegressor
dml_estimate = model.estimate_effect(identified_estimand, method_name="backdoor.econml.dml.DMLCateEstimator",
                                     control_value = 0,
                                     treatment_value = 1,
                                 confidence_intervals=False,
                                method_params={"init_params":{'model_y':GradientBoostingRegressor(),
                                                              'model_t': GradientBoostingRegressor(),
                                                              "model_final":LassoCV(), 
                                                              'featurizer':PolynomialFeatures(degree=1, include_bias=True)},
                                               "fit_params":{}})

You also mentioned that you are considering raw materials price as an IV. I may be a little careful there---if the raw materials price and the sales are affected by the same confounders, then it may not be a valid IV. If you are confident about the IV, perhaps you can try both backdoor and IV methods, and check whether you obtain similar estimates (as a robustness check).

I am reasonable sure about the IV, but will compare with other DML methods. Regarding the suggested causal graph:

If k > 1, should all prior values influence the current price and sales or just the subsequent one? I.e. should the graph include e.g. price(t-2) -> price(t-1) -> price(t) or price(t-2) -> price(t); price (t-1) -> price(t);?
Is it really confounder(t-1) -> price(t); confounder(t-1) -> sales(t)? Assuming confounder(t) -> price(t); confounder(t) -> sales(t), i.e. using the current time step's confounder values, seems more intuitive to me.

It is my understanding the the estimated effect is E(t = 1) - E(t = 0). In the case of a continuous treatment like price that is not restricted to an interval [0, 1] I expect the effect to represent E(t = treatment_value) - E(t = control_value), where treatment_value and control_value correspond to the values specified in estimate_effect(). So for price elasticity I would expect a positive value for estimate.value if treatment_value < control_value, assuming that lower prices increase sales. Is that correct?

In general, the former price(t-2) -> price(t-1) -> price(t) is a very simple assumption since given the previous price, we are assuming that the current price is independent of all history. That may be too restrictive, so I suggest the latter: price(t-2) -> price(t); price (t-1) -> price(t);
Yes, the current confounder should definitely affect the price and sales. In my last comment, I was suggesting that the previous confounder should also affect the current price and sales. But maybe that's not required, especially since we already include previous price and sales. Okay to remove that and go with your suggestion.

That's correct---for continuous treatments you can specify any treatment_value and control_value based on your requirement. If the treatment_value is lower than control_value, you should expect a positive effect. But a more conventional way to do it is to keep the treatment_value as the higher price, and then report a negative effect as evidence of price elasticity.

Right, that sounds reasonable. No further questions here, I'll close this issue. Thank you so much for your time :-)

py-why / dowhy