[Initial] Implementing Pipeline including RTB environment, data collection, offline RL, and OPE modules

Type of change

[x] New feature
[ ] Bugfix
[x] Document / Docstring
[ ] Other

Description

Inplemented the whole package.

RTB Synthetic Environment
- At env/ directory, we provide RTB simulation in Open-AI gym interface.
- The base environment, RTBEnv is in env/rtb.py. RTBEnv is a highly configurative environment. It calls RTBSyntheticSimulator (in env/simulation/rtb_synthetic.py) and Bidder (in env/bidder.py) to simulate auction environment and bidding function, respectively.
- Users can customize RTBSyntheticSimulator by setting the configurations of RTBEnv. The configurations include WinningPriceDistribution, ClickThroughRate, and ConversionRate. If you want to customize these class, you can refer to env/simulator/function.py and override the BaseClass.
- CustomizedRTBEnv (in env/wrapper_rtb.py) defines action space and thus determines the range of adjust rate in Bidder.
Synthetic Dataset Collection
- At dataset/ directory, we provide synthetic dataset generation module that is useful for both Offline RL and OPE.
- SyntheticDataset (in dataset/synthetic.py) allows data collection on arbitrary RL environment which follows Open-AI gym interface.
- The data collection policy can be defined using both "header" (in policy/head.py) and agents defined in d3rlpy. The header transforms the deterministic agents into a stochastic policy.
- Particularly, the implementation of header includes DiscreteEpsilonGreedyHead, DiscreteSoftmaxHead for discrete policies and ContinuousGaussianHead and ContinuousTruncatedGaussianHead for continuous policies.
Offline Reinforcement Learning
- We use implementation provided in d3rlpy.
- As d3rlpy does not handle OPE so much, we aim to bridge from data collection to offline RL, and finally to OPE.
- In addition, we support interactive surface as OnlineHead.
Off-Policy Evaluation
- At ope/ directory, we provide OPE modules in OpenBanditPipeline interface.
- Not only performing OPE for the given dataset and evaluation policies, we also evaluate and visualize the performance of OPE (e.g., accuracy) using synthetic environment (currently, any environment with fixed length of episodes are feasible).
- For now, we implement Direct Method (w/ Fitted Q Evaluation provided in d3rlpy), TrajectoryWiseImportanceSampling, StepWiseImportanceSampling, and DoublyRobust for both discrete and continuous policies. We also provide their self-normalizing versions as well. (Note that, we assume that the continuous policies are deterministic.)
Quickstart Example
- We provide whole procedure (online RL -> data collection -> offline RL -> OPE) in quickstart notebook.
- Discrete case: examples/quickstart/rtb_synthetic_discrete.ipynb.
- Continuous case: examples/quickstart/rtb_synthetic_continuous.ipynb.

We also describe how to customize RTB environment and dataset in quickstart notebook.
- examples/quickstart/rtb_synthetic_customize_env.ipynb
- examples/quickstart/rtb_synthetic_data_collection.ipynb

References

Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita.\ "Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation.", 2021.
Takuma Seno and Michita Imai.\ "d3rlpy: An Offline Deep Reinforcement Library.", 2021.
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. "Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems." 2020.
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. "Conservative Q-Learning for Offline Reinforcement Learning.", 2020.
Nathan Kallus and Angela Zhou.\ "Policy Evaluation and Optimization with Continuous Treatments.", 2019.
Nathan Kallus and Masatoshi Uehara. "Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning.", 2019.
Hoang Le, Cameron Voloshin, and Yisong Yue.\ "Batch Policy Learning under Constraints.", 2019.
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." 2018.
Di Wu, Xiujun Chen, Xun Yang, Hao Wang, Qing Tan, Xiaoxun Zhang, Jian Xu, and Kun Gai.\ "Budget Constrained Bidding by Model-free Reinforcement Learning in Display Advertising.", 2018.
Jun Zhao, Guang Qiu, Ziyu Guan, Wei Zhao, and Xiaofei He.\ "Deep Reinforcement Learning for Sponsored Search Real-time Bidding.", 2018.
Wen-Yuan Zhu, Wen-Yueh Shih, Ying-Hsuan Lee, Wen-Chih Peng, and Jiun-Long Huang.\ "A Gamma-based Regression for Winning Price Estimation in Real-Time Bidding Advertising.", 2017.
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, abd Wojciech Zaremba.\ "OpenAI Gym.", 2016.
Nan Jiang and Lihong Li.\ "Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.", 2016.
Philip S. Thomas and Emma Brunskill.\ "Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning.", 2016.
Hado van Hasselt, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double Q-learning.", 2015.
Adith Swaminathan and Thorsten Joachims. "The Self-Normalized Estimator for Counterfactual Learning.", 2015.
Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li.\ "Doubly Robust Policy Evaluation and Optimization.", 2014
Alex Strehl, John Langford, Sham Kakade, and Lihong Li.\ "Learning from Logged Implicit Exploration Data.", 2010.
Alina Beygelzimer and John Langford.\ "The Offset Tree for Learning with Partial Labels.", 2009.
Doina Precup, Richard S. Sutton, and Satinder P. Singh.\ "Eligibility Traces for Off-Policy Policy Evaluation.", 2000.

Checklist

[ ] pass unit test (or unnecessary)
[ ] no errors on newly made test cases
[ ] no errors on existing test cases
[x] applied black formatter
[x] no errors on flake8
[x] no warnings
[ ] work in progress

Comments

tests will be added in the following PRs.

hakuhodo-technologies / scope-rl

[Initial] Implementing Pipeline including RTB environment, data collection, offline RL, and OPE modules #3

Type of change

Description

References

Checklist

Comments