At env/ directory, we provide RTB simulation in Open-AI gym interface.
The base environment, RTBEnv is in env/rtb.py. RTBEnv is a highly configurative environment. It calls RTBSyntheticSimulator (in env/simulation/rtb_synthetic.py) and Bidder (in env/bidder.py) to simulate auction environment and bidding function, respectively.
Users can customize RTBSyntheticSimulator by setting the configurations of RTBEnv. The configurations include WinningPriceDistribution, ClickThroughRate, and ConversionRate. If you want to customize these class, you can refer to env/simulator/function.py and override the BaseClass.
CustomizedRTBEnv (in env/wrapper_rtb.py) defines action space and thus determines the range of adjust rate in Bidder.
Synthetic Dataset Collection
At dataset/ directory, we provide synthetic dataset generation module that is useful for both Offline RL and OPE.
SyntheticDataset (in dataset/synthetic.py) allows data collection on arbitrary RL environment which follows Open-AI gym interface.
The data collection policy can be defined using both "header" (in policy/head.py) and agents defined in d3rlpy. The header transforms the deterministic agents into a stochastic policy.
Particularly, the implementation of header includes DiscreteEpsilonGreedyHead, DiscreteSoftmaxHead for discrete policies and ContinuousGaussianHead and ContinuousTruncatedGaussianHead for continuous policies.
Not only performing OPE for the given dataset and evaluation policies, we also evaluate and visualize the performance of OPE (e.g., accuracy) using synthetic environment (currently, any environment with fixed length of episodes are feasible).
For now, we implement Direct Method (w/ Fitted Q Evaluation provided in d3rlpy), TrajectoryWiseImportanceSampling, StepWiseImportanceSampling, and DoublyRobust for both discrete and continuous policies. We also provide their self-normalizing versions as well. (Note that, we assume that the continuous policies are deterministic.)
Quickstart Example
We provide whole procedure (online RL -> data collection -> offline RL -> OPE) in quickstart notebook.
Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita.\
"Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.", 2021.
Takuma Seno and Michita Imai.\
"d3rlpy: An Offline Deep Reinforcement Library.", 2021.
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu.
"Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems." 2020.
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine.
"Conservative Q-Learning for Offline Reinforcement Learning.", 2020.
Nathan Kallus and Angela Zhou.\
"Policy Evaluation and Optimization with Continuous Treatments.", 2019.
Nathan Kallus and Masatoshi Uehara.
"Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning.", 2019.
Hoang Le, Cameron Voloshin, and Yisong Yue.\
"Batch Policy Learning under Constraints.", 2019.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine.
"Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." 2018.
Di Wu, Xiujun Chen, Xun Yang, Hao Wang, Qing Tan, Xiaoxun Zhang, Jian Xu, and Kun Gai.\
"Budget Constrained Bidding by Model-free Reinforcement Learning in Display Advertising.", 2018.
Jun Zhao, Guang Qiu, Ziyu Guan, Wei Zhao, and Xiaofei He.\
"Deep Reinforcement Learning for Sponsored Search Real-time Bidding.", 2018.
Wen-Yuan Zhu, Wen-Yueh Shih, Ying-Hsuan Lee, Wen-Chih Peng, and Jiun-Long Huang.\
"A Gamma-based Regression for Winning Price Estimation in Real-Time Bidding Advertising.", 2017.
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, abd Wojciech Zaremba.\
"OpenAI Gym.", 2016.
Nan Jiang and Lihong Li.\
"Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.", 2016.
Philip S. Thomas and Emma Brunskill.\
"Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.", 2016.
Hado van Hasselt, Arthur Guez, and David Silver.
"Deep Reinforcement Learning with Double Q-learning.", 2015.
Adith Swaminathan and Thorsten Joachims.
"The Self-Normalized Estimator for Counterfactual Learning.", 2015.
Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li.\
"Doubly Robust Policy Evaluation and Optimization.", 2014
Alex Strehl, John Langford, Sham Kakade, and Lihong Li.\
"Learning from Logged Implicit Exploration Data.", 2010.
Alina Beygelzimer and John Langford.\
"The Offset Tree for Learning with Partial Labels.", 2009.
Doina Precup, Richard S. Sutton, and Satinder P. Singh.\
"Eligibility Traces for Off-Policy Policy Evaluation.", 2000.
Type of change
Description
Inplemented the whole package.
RTB Synthetic Environment
RTBEnv
is in env/rtb.py.RTBEnv
is a highly configurative environment. It callsRTBSyntheticSimulator
(in env/simulation/rtb_synthetic.py) andBidder
(in env/bidder.py) to simulate auction environment and bidding function, respectively.RTBSyntheticSimulator
by setting the configurations ofRTBEnv
. The configurations includeWinningPriceDistribution
,ClickThroughRate
, andConversionRate
. If you want to customize these class, you can refer to env/simulator/function.py and override the BaseClass.CustomizedRTBEnv
(in env/wrapper_rtb.py) defines action space and thus determines the range of adjust rate inBidder
.Synthetic Dataset Collection
SyntheticDataset
(in dataset/synthetic.py) allows data collection on arbitrary RL environment which follows Open-AI gym interface.policy/head.py
) and agents defined in d3rlpy. The header transforms the deterministic agents into a stochastic policy.DiscreteEpsilonGreedyHead
,DiscreteSoftmaxHead
for discrete policies andContinuousGaussianHead
andContinuousTruncatedGaussianHead
for continuous policies.Offline Reinforcement Learning
OnlineHead
.Off-Policy Evaluation
Direct Method
(w/ Fitted Q Evaluation provided in d3rlpy),TrajectoryWiseImportanceSampling
,StepWiseImportanceSampling
, andDoublyRobust
for both discrete and continuous policies. We also provide their self-normalizing versions as well. (Note that, we assume that the continuous policies are deterministic.)Quickstart Example
References
Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita.\ "Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.", 2021.
Takuma Seno and Michita Imai.\ "d3rlpy: An Offline Deep Reinforcement Library.", 2021.
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. "Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems." 2020.
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. "Conservative Q-Learning for Offline Reinforcement Learning.", 2020.
Nathan Kallus and Angela Zhou.\ "Policy Evaluation and Optimization with Continuous Treatments.", 2019.
Nathan Kallus and Masatoshi Uehara. "Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning.", 2019.
Hoang Le, Cameron Voloshin, and Yisong Yue.\ "Batch Policy Learning under Constraints.", 2019.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." 2018.
Di Wu, Xiujun Chen, Xun Yang, Hao Wang, Qing Tan, Xiaoxun Zhang, Jian Xu, and Kun Gai.\ "Budget Constrained Bidding by Model-free Reinforcement Learning in Display Advertising.", 2018.
Jun Zhao, Guang Qiu, Ziyu Guan, Wei Zhao, and Xiaofei He.\ "Deep Reinforcement Learning for Sponsored Search Real-time Bidding.", 2018.
Wen-Yuan Zhu, Wen-Yueh Shih, Ying-Hsuan Lee, Wen-Chih Peng, and Jiun-Long Huang.\ "A Gamma-based Regression for Winning Price Estimation in Real-Time Bidding Advertising.", 2017.
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, abd Wojciech Zaremba.\ "OpenAI Gym.", 2016.
Nan Jiang and Lihong Li.\ "Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.", 2016.
Philip S. Thomas and Emma Brunskill.\ "Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.", 2016.
Hado van Hasselt, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double Q-learning.", 2015.
Adith Swaminathan and Thorsten Joachims. "The Self-Normalized Estimator for Counterfactual Learning.", 2015.
Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li.\ "Doubly Robust Policy Evaluation and Optimization.", 2014
Alex Strehl, John Langford, Sham Kakade, and Lihong Li.\ "Learning from Logged Implicit Exploration Data.", 2010.
Alina Beygelzimer and John Langford.\ "The Offset Tree for Learning with Partial Labels.", 2009.
Doina Precup, Richard S. Sutton, and Satinder P. Singh.\ "Eligibility Traces for Off-Policy Policy Evaluation.", 2000.
Checklist
Comments