hakuhodo-technologies / scope-rl

SCOPE-RL: A python library for offline reinforcement learning, off-policy evaluation, and selection
https://scope-rl.readthedocs.io/en/latest/
Apache License 2.0
109 stars 11 forks source link

[Initial] Implementing Pipeline including RTB environment, data collection, offline RL, and OPE modules #3

Closed aiueola closed 2 years ago

aiueola commented 2 years ago

Type of change

Description

Inplemented the whole package.

  1. RTB Synthetic Environment

    • At env/ directory, we provide RTB simulation in Open-AI gym interface.
    • The base environment, RTBEnv is in env/rtb.py. RTBEnv is a highly configurative environment. It calls RTBSyntheticSimulator (in env/simulation/rtb_synthetic.py) and Bidder (in env/bidder.py) to simulate auction environment and bidding function, respectively.
    • Users can customize RTBSyntheticSimulator by setting the configurations of RTBEnv. The configurations include WinningPriceDistribution, ClickThroughRate, and ConversionRate. If you want to customize these class, you can refer to env/simulator/function.py and override the BaseClass.
    • CustomizedRTBEnv (in env/wrapper_rtb.py) defines action space and thus determines the range of adjust rate in Bidder.
  2. Synthetic Dataset Collection

    • At dataset/ directory, we provide synthetic dataset generation module that is useful for both Offline RL and OPE.
    • SyntheticDataset (in dataset/synthetic.py) allows data collection on arbitrary RL environment which follows Open-AI gym interface.
    • The data collection policy can be defined using both "header" (in policy/head.py) and agents defined in d3rlpy. The header transforms the deterministic agents into a stochastic policy.
    • Particularly, the implementation of header includes DiscreteEpsilonGreedyHead, DiscreteSoftmaxHead for discrete policies and ContinuousGaussianHead and ContinuousTruncatedGaussianHead for continuous policies.
  3. Offline Reinforcement Learning

    • We use implementation provided in d3rlpy.
    • As d3rlpy does not handle OPE so much, we aim to bridge from data collection to offline RL, and finally to OPE.
    • In addition, we support interactive surface as OnlineHead.
  4. Off-Policy Evaluation

    • At ope/ directory, we provide OPE modules in OpenBanditPipeline interface.
    • Not only performing OPE for the given dataset and evaluation policies, we also evaluate and visualize the performance of OPE (e.g., accuracy) using synthetic environment (currently, any environment with fixed length of episodes are feasible).
    • For now, we implement Direct Method (w/ Fitted Q Evaluation provided in d3rlpy), TrajectoryWiseImportanceSampling, StepWiseImportanceSampling, and DoublyRobust for both discrete and continuous policies. We also provide their self-normalizing versions as well. (Note that, we assume that the continuous policies are deterministic.)
  5. Quickstart Example

References

Checklist

Comments

aiueola commented 2 years ago

@m-takeuchi-negocia Thank you for the throughout comments! I will refactor the code based on your review.

aiueola commented 2 years ago

@k-kawakami213 Thank you for the feedback! I will fix the issues you've pointed out.