tensorflow / agents

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.
Apache License 2.0
2.81k stars 720 forks source link

Unable to call action on a loaded Bandit Policy #537

Closed pm3310 closed 3 years ago

pm3310 commented 3 years ago

Hi team,

Here's the code that trains and saves a Bandit policy

import numpy as np

import tensorflow as tf
from tf_agents.bandits.agents import lin_ucb_agent
from tf_agents.bandits.environments import stationary_stochastic_py_environment as sspe
from tf_agents.bandits.metrics import tf_metrics
from tf_agents.drivers import dynamic_step_driver
from tf_agents.environments import tf_py_environment
from tf_agents.policies.policy_saver import PolicySaver
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.trajectories import time_step as ts

import matplotlib.pyplot as plt
from tf_agents.specs import tensor_spec

batch_size = 2  # @param
arm0_param = [-3, 0, 1, -2]  # @param
arm1_param = [1, -2, 3, 0]  # @param
arm2_param = [0, 0, 1, 1]  # @param

def context_sampling_fn(batch_size):
    """Contexts from [-10, 10]^4."""
    def _context_sampling_fn():
        return np.random.randint(-10, 10, [batch_size, 4]).astype(np.float32)

    return _context_sampling_fn

class LinearNormalReward(object):
    """A class that acts as linear reward function when called."""
    def __init__(self, theta, sigma):
        self.theta = theta
        self.sigma = sigma

    def __call__(self, x):
        mu = np.dot(x, self.theta)
        return np.random.normal(mu, self.sigma)

arm0_reward_fn = LinearNormalReward(arm0_param, 1)
arm1_reward_fn = LinearNormalReward(arm1_param, 1)
arm2_reward_fn = LinearNormalReward(arm2_param, 1)

environment = tf_py_environment.TFPyEnvironment(
    sspe.StationaryStochasticPyEnvironment(
        context_sampling_fn(batch_size),
        [arm0_reward_fn, arm1_reward_fn, arm2_reward_fn],
        batch_size=batch_size
    )
)

observation_spec = tensor_spec.TensorSpec([4], tf.float32)
time_step_spec = ts.time_step_spec(observation_spec)
action_spec = tensor_spec.BoundedTensorSpec(dtype=tf.int32, shape=(), minimum=0, maximum=2)

agent = lin_ucb_agent.LinearUCBAgent(time_step_spec=time_step_spec, action_spec=action_spec)

def compute_optimal_reward(observation):
    expected_reward_for_arms = [
      tf.linalg.matvec(observation, tf.cast(arm0_param, dtype=tf.float32)),
      tf.linalg.matvec(observation, tf.cast(arm1_param, dtype=tf.float32)),
      tf.linalg.matvec(observation, tf.cast(arm2_param, dtype=tf.float32))
    ]

    optimal_action_reward = tf.reduce_max(expected_reward_for_arms, axis=0)
    return optimal_action_reward

regret_metric = tf_metrics.RegretMetric(compute_optimal_reward)

num_iterations = 90  # @param
steps_per_loop = 1  # @param

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.policy.trajectory_spec,
    batch_size=batch_size,
    max_length=steps_per_loop
)

observers = [replay_buffer.add_batch, regret_metric]

driver = dynamic_step_driver.DynamicStepDriver(
    env=environment,
    policy=agent.collect_policy,
    num_steps=steps_per_loop * batch_size,
    observers=observers
)

regret_values = []

for _ in range(num_iterations):
    driver.run()
    loss_info = agent.train(replay_buffer.gather_all())
    replay_buffer.clear()
    regret_values.append(regret_metric.result())

policy_saver = PolicySaver(policy=agent.collect_policy)
policy_saver.save(export_dir='my_awesome_policy')

plt.plot(regret_values)
plt.ylabel('Average Regret')
plt.xlabel('Number of Iterations')

plt.show()

And here is the code that loads the previously trained Policy

import tensorflow as tf
from tf_agents.policies import policy_loader
from tf_agents.trajectories import time_step

policy = policy_loader.load('my_awesome_policy')

time_step_obj = time_step.TimeStep(
    discount=tf.convert_to_tensor([1., 1.]),
    observation=tf.convert_to_tensor([[3., -10, 6., -8.], [4., 3., 1., 0.]]),
    reward=tf.convert_to_tensor([0., 0.]),
    step_type=tf.constant(time_step.StepType.MID, dtype=tf.int32, shape=[2], name='step_type')
)

policy.action(time_step=time_step_obj, policy_state=policy.get_initial_state(2))

However, the line policy.action(time_step=time_step_obj, policy_state=policy.get_initial_state(2)) generates the following error:

Traceback (most recent call last):
  File "/Users/pntompos/Documents/repos/tf-agents-poc/src/3_load_contextual_bandit.py", line 14, in <module>
    policy.action(time_step=time_step_obj, policy_state=policy.get_initial_state(2))
  File "/Users/pntompos/.virtualenvs/tf-agents-poc/lib/python3.7/site-packages/tf_agents/policies/py_policy.py", line 156, in action
    return self._action(time_step, policy_state)
  File "/Users/pntompos/.virtualenvs/tf-agents-poc/lib/python3.7/site-packages/tf_agents/policies/py_tf_eager_policy.py", line 73, in _action
    policy_step = self._policy_action_fn(time_step, policy_state)
  File "/Users/pntompos/.virtualenvs/tf-agents-poc/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/Users/pntompos/.virtualenvs/tf-agents-poc/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 823, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/Users/pntompos/.virtualenvs/tf-agents-poc/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 697, in _initialize
    *args, **kwds))
  File "/Users/pntompos/.virtualenvs/tf-agents-poc/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2855, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/Users/pntompos/.virtualenvs/tf-agents-poc/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3213, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/Users/pntompos/.virtualenvs/tf-agents-poc/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3075, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/Users/pntompos/.virtualenvs/tf-agents-poc/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 986, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/Users/pntompos/.virtualenvs/tf-agents-poc/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 600, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/Users/pntompos/.virtualenvs/tf-agents-poc/lib/python3.7/site-packages/tensorflow/python/saved_model/function_deserialization.py", line 257, in restored_function_body
    "\n\n".join(signature_descriptions)))
ValueError: Could not find matching function to call loaded from the SavedModel. Got:
  Positional arguments (2 total):
    * TimeStep(step_type=<tf.Tensor 'time_step:0' shape=(1, 2) dtype=int32>, reward=<tf.Tensor 'time_step_1:0' shape=(1, 2) dtype=float32>, discount=<tf.Tensor 'time_step_2:0' shape=(1, 2) dtype=float32>, observation=<tf.Tensor 'time_step_3:0' shape=(1, 2, 4) dtype=float32>)
    * ()
  Keyword arguments: {}

Expected these arguments to match one of the following 2 option(s):

Option 1:
  Positional arguments (2 total):
    * TimeStep(step_type=TensorSpec(shape=(None,), dtype=tf.int32, name='time_step/step_type'), reward=TensorSpec(shape=(None,), dtype=tf.float32, name='time_step/reward'), discount=TensorSpec(shape=(None,), dtype=tf.float32, name='time_step/discount'), observation=TensorSpec(shape=(None, 4), dtype=tf.float32, name='time_step/observation'))
    * ()
  Keyword arguments: {}

Option 2:
  Positional arguments (2 total):
    * TimeStep(step_type=TensorSpec(shape=(None,), dtype=tf.int32, name='step_type'), reward=TensorSpec(shape=(None,), dtype=tf.float32, name='reward'), discount=TensorSpec(shape=(None,), dtype=tf.float32, name='discount'), observation=TensorSpec(shape=(None, 4), dtype=tf.float32, name='observation'))
    * ()
  Keyword arguments: {}

My goal is to have a Bandit in a RESTful endpoint to sample from and train in an online fashion. Do you have any best practices on how to deploy Bandits as a RESTful service?

pm3310 commented 3 years ago

I used policy = tf.saved_model.load('my_awesome_policy') instead and it worked

tfboyd commented 3 years ago

Assigned to bandits team. They can close unless they have a comment.

pm3310 commented 3 years ago

Hey @bartokg , I have 2 questions:

bartokg commented 3 years ago

Hi Pavlos, Happy to see you found a working loader function! I never used policy_loader and it's hard to see why it fails. From the documentation of policy_saver.PolicySaver (https://github.com/tensorflow/agents/blob/master/tf_agents/policies/policy_saver.py), it recommends using

saved_policy = tf.compat.v2.saved_model.load('policy_0')
policy_state = saved_policy.get_initial_state(batch_size=3)

as you also suggest. I assume the compat.v2 can be omitted now.

pm3310 commented 3 years ago

thank you @bartokg Any suggestion for putting a bandit agent in production for continuous learning?

bartokg commented 3 years ago

In general what you want is a trainer that consumes data (with the train() function of the agent), you save the model periodically, then another binary can periodically load the latest model (policy) and call action(). This is all doable by hand. If you want a fully productionized solution, you can try TensorFlow Extended, that integrates well with TF-Agents bandits.