Closed t0biasm closed 5 years ago
It kind of depends on where you need them, but in general you can add to the extra fetches when computing actions. That way the values will be put in the sample batches.
Example for logits from impala: https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/impala/vtrace_policy_graph.py#L272
You can also add a method to compute this given the session, for example compute_td_error() that you can just call on the policy object through the policy map. https://github.com/ray-project/ray/search?q=compute_td_error&unscoped_q=compute_td_error
Thanks for the quick answer. What I would like to do is to plot the Q function when the training process of my simple self styled environment with a state of one dimension and an action with also one dimension has finished.
class ScalarEnv(gym.Env):
def __init__(self):
self.sdim = 1
self.state_threshold = np.array([2.5])
self.action_threshold = 10
self.done = False
self.maxsteps = 200
self.action_space = spaces.Box(low=-self.action_threshold,
high=self.action_threshold,
shape=(1,),
dtype=np.float32)
self.observation_space = spaces.Box(low=-self.state_threshold,
high=self.state_threshold,
dtype=np.float32)
self.reset()
def reward(self, action):
r = -(self.state[0] ** 2) - 0.1 * (action[0] ** 2)
return r
def reset(self):
self.state = np.array([5.0 * np.random.random() - 2.5])
return self.state
def step(self, action):
state = self.state
state = state + action
done = self.checkdone(state)
reward = self.reward(action)
self.state = state
return self.state, reward, done, {}
def checkdone(self, state):
finished = state < -self.state_threshold or state > self.state_threshold
return finished
So I think to add an own method to the policy graph, which can be called through the policy map when the training process has finished would be the best.
To access the Q values, I started to create following function (where I was not sure what to put in for the network output (___). I think I have to call the _build_q_network
method or am I totally wrong with this guess?
def compute_q_values(self, obs_t, observation_space, act_t):
# with tf.variable_scope(Q_SCOPE) as scope:
# q, _ = self._build_q_network(obs_t, observation_space, act_t)
# self.q_func_vars = _scope_vars(scope.name)
q_values = self.sess.run(_______, feed_dict={self.obs_t: [np.array(ob) for ob in obs_t], self.act_t: act_t })
return q_values
My QNetwork (one hidden layer with 10 neurons):
class QNetwork(object):
def __init__(self,
model,
action_inputs,
hiddens=[10],
activation='tanh'):
q_out = tf.concat([model.last_layer, action_inputs], axis=1)
activation = tf.nn.__dict__[activation]
for hidden in hiddens:
q_out = layers.fully_connected(
q_out, num_outputs=hidden, activation_fn=activation)
self.value = layers.fully_connected(
q_out, num_outputs=1, activation_fn=None)
self.model = model
Hey @t0biasm , I think what you're looking for is q_t
here? https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/ddpg/ddpg_policy_graph.py#L270
Right now it's a local variable but you could just assign self.q_t = q_t
and then you can pass it to self.sess.run().
Great, thanks for your help @ericl
Describe the problem
What is the best way to access the Q values of a state (state meaning a state in the MDP not the state of the model) of a DDPG Agent?
I suppose that I have to modify the graph class in the source code to geht the Q values, but actually have no idea how to do this. Or is there an easy way to access the Q values about the policy_map?
I appreciate any help.