ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.01k stars 5.78k forks source link

[rllib] Best way to access Q values of an DDPG Agent #3505

Closed t0biasm closed 5 years ago

t0biasm commented 5 years ago

Describe the problem

What is the best way to access the Q values of a state (state meaning a state in the MDP not the state of the model) of a DDPG Agent?

I suppose that I have to modify the graph class in the source code to geht the Q values, but actually have no idea how to do this. Or is there an easy way to access the Q values about the policy_map?

I appreciate any help.

ericl commented 5 years ago

It kind of depends on where you need them, but in general you can add to the extra fetches when computing actions. That way the values will be put in the sample batches.

Example for logits from impala: https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/impala/vtrace_policy_graph.py#L272

You can also add a method to compute this given the session, for example compute_td_error() that you can just call on the policy object through the policy map. https://github.com/ray-project/ray/search?q=compute_td_error&unscoped_q=compute_td_error

t0biasm commented 5 years ago

Thanks for the quick answer. What I would like to do is to plot the Q function when the training process of my simple self styled environment with a state of one dimension and an action with also one dimension has finished.

class ScalarEnv(gym.Env):
    def __init__(self):
        self.sdim = 1
        self.state_threshold = np.array([2.5])
        self.action_threshold = 10
        self.done = False
        self.maxsteps = 200
        self.action_space = spaces.Box(low=-self.action_threshold,
                                       high=self.action_threshold,
                                       shape=(1,),
                                       dtype=np.float32)
        self.observation_space = spaces.Box(low=-self.state_threshold,
                                            high=self.state_threshold,
                                            dtype=np.float32)
        self.reset()

    def reward(self, action):
        r = -(self.state[0] ** 2) - 0.1 * (action[0] ** 2)
        return r

    def reset(self):
        self.state = np.array([5.0 * np.random.random() - 2.5])
        return self.state

    def step(self, action):
        state = self.state
        state = state + action
        done = self.checkdone(state)
        reward = self.reward(action)
        self.state = state
        return self.state, reward, done, {}

    def checkdone(self, state):
        finished = state < -self.state_threshold or state > self.state_threshold
        return finished

So I think to add an own method to the policy graph, which can be called through the policy map when the training process has finished would be the best. To access the Q values, I started to create following function (where I was not sure what to put in for the network output (___). I think I have to call the _build_q_network method or am I totally wrong with this guess?

def compute_q_values(self, obs_t, observation_space, act_t):
    # with tf.variable_scope(Q_SCOPE) as scope:
    #    q, _ = self._build_q_network(obs_t, observation_space, act_t) 
    #    self.q_func_vars = _scope_vars(scope.name)
    q_values = self.sess.run(_______, feed_dict={self.obs_t: [np.array(ob) for ob in obs_t], self.act_t: act_t })
return q_values

My QNetwork (one hidden layer with 10 neurons):

class QNetwork(object):
    def __init__(self,
                 model,
                 action_inputs,
                 hiddens=[10],
                 activation='tanh'):
        q_out = tf.concat([model.last_layer, action_inputs], axis=1)
        activation = tf.nn.__dict__[activation]
        for hidden in hiddens:
            q_out = layers.fully_connected(
                q_out, num_outputs=hidden, activation_fn=activation)
        self.value = layers.fully_connected(
            q_out, num_outputs=1, activation_fn=None)
        self.model = model
ericl commented 5 years ago

Hey @t0biasm , I think what you're looking for is q_t here? https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/ddpg/ddpg_policy_graph.py#L270

Right now it's a local variable but you could just assign self.q_t = q_t and then you can pass it to self.sess.run().

t0biasm commented 5 years ago

Great, thanks for your help @ericl