Offline Contextual Bandits

alex-seto commented 2 years ago

Hello all!

I have been playing with LinUCB in an attempt to set up a recipe recommendation system using historical data. I have read through the original LinUCB paper as well as http://www.gatsby.ucl.ac.uk/~chuwei/paper/wsdm11.pdf. To my understanding, the implementations of linUCB in tf-agents are designed to be online learning algorithms where the models of the environment are pretty much complete; I was wondering if anyone has had any experience/has implemented a linear bandit with training on historical data (using a different policy to log) specifically using the tf-agents library; I have seen that vowpal wabbit may have offline bandit algorithms would like to stick to the tf-agents framework as I move away from my baseline LinUCB bandit to a more complex one which utilizes deep learning to approximate rewards.

Thanks!

bartokg commented 2 years ago

Hi Alex, You can call the train function on historical data, with any tf-agent agent, including LinUCB. You need to instantiate an agent, load the data, make sure the data is in the form of "trajectories". Then you can just call agent.train() on it. After training, you can call agent.policy.action() to choose an action on fresh observations. Hope this helps. Gabor

sj31867 commented 1 year ago

Hi @bartokg I have tried training my agent in a similar fashion as you suggested but after training my model is keep on selecting the same arm for all the observation. I am not getting the exact reason why this is happening. I have not normalized my context features, will this hamper the prediction? or Agent handles the normalization of data internally?

subhambiswas-angelone commented 1 year ago

Hi @bartokg can you please paste a simple example of theses steps

tensorflow / agents

Offline Contextual Bandits #672