Open JunhaoWang opened 4 years ago
Sorry for the late response; for your question about scalability, it's also feedback that we've received from a few other people so it's something we're starting to think about, but we don't have a short-term solution; even in the longer term, we don't have ambitions to scale to distributed settings, but we would like to make it possible to efficiently process as much data as you can load onto a single machine.
For the first question, maybe you could elaborate on your goals but I can't immediately see a good way to use our package in a traditional contextual bandit setting. We have estimators like the NonParamDMLCateEstimator that use continuous treatments and can learn nonparametric conditional treatment effects. It's true that the estimated effect is expressed as the difference in R when moving from A=0 to some other value a, but you can work around this by adding an prediction of the reward at A=0 (which you could get from the fit first-stage y models, for instance). But I'm not sure that the requirements for the data generating process for our DML methods to work correctly will be satisfied in a contextual bandit setting, because unless your policy is completely random, your choices of which treatments to apply won't depend solely on the state S but also on previously seen rewards R.
I have a contextual bandit problem with (S - state, A - action, R - reward) where S is high-dimensional vector, A is continuous value, R is continuous value, how do I learn optimal mapping function from state to action to maximize reward? It seems the current package can only estimate continuous treatment effect given a control treatment, but O don't have a control treatment. Furthermore, all current estimators don't scale to large data and high dimensional (millions of samples in train / test set, thousands of dimensions). Is there a way to make it more scalable?