py-why / EconML

ALICE (Automated Learning and Intelligence for Causation and Economics) is a Microsoft Research project aimed at applying Artificial Intelligence concepts to economic decision making. One of its goals is to build a toolkit that combines state-of-the-art machine learning techniques with econometrics in order to bring automation to complex causal inference problems. To date, the ALICE Python SDK (econml) implements orthogonal machine learning algorithms such as the double machine learning work of Chernozhukov et al. This toolkit is designed to measure the causal effect of some treatment variable(s) t on an outcome variable y, controlling for a set of features x.
https://www.microsoft.com/en-us/research/project/alice/
Other
3.69k stars 697 forks source link

Pricing use case : Dataset requirements #547

Closed sami-ka closed 2 years ago

sami-ka commented 2 years ago

I am currently using econml in a pricing use case.

In my dataset I have the transactions aggregated by months, indicating which product and how much a customer consumes every month. If customer 1 buys 10 units of product A in January and 5 in March, I will have something like this :

Period Customer Product Quantity
January 1 A 10
March 1 A 5

Is it ok to leave it as is?

Should I complete the dataset and have an additional row for each period of time without transactions? I have illustrated my point in the table below.

Period Customer Product Quantity
January 1 A 10
February 1 A 0
March 1 A 5
kbattocchi commented 2 years ago

Yes, I think it would be best to add the missing months; if you're using something like DoubleML, then in the first stage we'll fit price and quantity models and you'd want the quantity model training process to have access to the fact that 0 units were sold in those months, and you'd also want those months to be represented in the second stage when we're using the residuals from those predictions to train the final model.

sami-ka commented 2 years ago

Thanks for your answer !