[rllib] Modifying DQN to learn over multiple horizons

ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

https://ray.io

Apache License 2.0

33.01k stars 5.59k forks source link

[rllib] Modifying DQN to learn over multiple horizons #15683

Closed rfali closed 2 years ago

rfali commented 3 years ago

Hi,

I had posted my question 8 days ago on ray discourse [here] but didn't hear back. So I am opening an issue here, as I am getting errors on my custom model and I am not sure what is wrong (can I get some help @sven1977?)

I want to customize RLLib's DQN in a way that it outputs n (let's say 10) number of Q-values where each Q-value uses a different discount factor gamma that is also passed as an input argument. I am trying to implement an architecture from this paper which is shown on page 14 Figure 9. I have 2 questions:

Can I can try to define a CustomModel class using this RLlib example code which could implement this architecture? Is this doable in a way that it does not mess up with the rest of RLlib (which I am trying to learn and am no expert). I want to use a TF model. My Custom Model and its summary is appended at the end.
What will happen to the config [gamma] as I don't want to implement a fixed gamma value (which can be passed to RLlib algorithms), rather I want to pass a list of gammas when the neural network is created? I am not sure how the config [gamma] will behave in this case?

Here is the model I made based on Custom DQN Model example

from ray.rllib.agents.dqn.distributional_q_tf_model import DistributionalQTFModel

class HyperbolicDQNModel(DistributionalQTFModel):

    def __init__(self, obs_space, action_space, num_outputs, model_config,
                name,
                gamma_max=0.99,
                hyp_exponent=0.1,
                number_of_gammas=8,
                acting_policy = 'hyperbolic', # 'largest_gamma' 'hyperbolic'
                integral_estimate='lower',
                **kw):

        # Define the core model layers which will be used by the other
        # output heads of DistributionalQModel
        self.inputs = tf.keras.layers.Input(
            shape=(84,84,4), name="observations") #changed

        self.inputs2 = tf.keras.layers.Input(
            shape=(2,), name='agent_indicator') #added

        layer_1 = tf.keras.layers.Conv2D(
                filters=32,
                kernel_size=[8, 8],
                strides=(4, 4),
                activation="relu",
                data_format='channels_last',
                name='layer1')(self.inputs)

        layer_2 = tf.keras.layers.Conv2D(
                filters=64,
                kernel_size=[4, 4],
                strides=(2, 2),
                activation="relu",
                data_format='channels_last',
                name='layer2')(layer_1)

        layer_3 = tf.keras.layers.Conv2D(
                filters=64,
                kernel_size=[3, 3],
                strides=(1, 1),
                activation="relu",
                data_format='channels_last',
                name='layer3')(layer_2)

        layer_4 = tf.keras.layers.Flatten(
                name='layer4')(layer_3)

        concat_layer = tf.keras.layers.Concatenate()([layer_4, self.inputs2]) #added

        layer_5 = tf.keras.layers.Dense(
                512,
                name="layer5",
                activation=tf.nn.relu, #renamed
                kernel_initializer=normc_initializer(1.0))(concat_layer) #changed

        q_values = []

        for i in range(number_of_gammas):
            gamma_q_value = tf.keras.layers.Dense(
                num_outputs,
                activation="linear",
                name = f'gamma_q_layer{i}',
                kernel_initializer=normc_initializer(1.0))(layer_5)
            q_values.append(gamma_q_value)

        hyp_q_value = integrate_q_values(q_values, integral_estimate, 
                                        eval_gammas,number_of_gammas, 
                                        gammas)

        if acting_policy == 'largest_gamma':
            layer_out = q_values[-1]
        elif acting_policy == 'hyperbolic':
            layer_out = hyp_q_value

        #self.base_model = tf.keras.Model(self.inputs, layer_out)
        #self.base_model = tf.keras.Model([self.inputs, self.inputs2], layer_out) # this throws errors about Tensors
        self.base_model = tf.keras.Model([self.inputs, self.inputs2], q_values) #this shows a model closer to what I want, but I need to select one of largest_gamma or hyperbolic

    # Implement the core forward method.
    def forward(self, input_dict, state, seq_lens):
        #model_out = self.base_model(input_dict["obs"])
        model_out = self.base_model([input_dict["obs"][:,:,:,0:4], input_dict["obs"][:,0,0,4:6]])
        return model_out, state

    def metrics(self):
        return {"foo": tf.constant(42.0)}

Here is how I am trying to confirm the model output

hyper_model = HyperbolicDQNModel(obs_space, action_space, num_outputs, config, name='Hyperbolic')
print(hyper_model.base_model.summary())
plot_model(hyper_model.base_model, 'model.png', show_shapes=True)

Here is the model

The tensor error is as follows if i use layer_out ValueError: Output is not a tensor: [[<tf.Tensor 'policy_0/model_2/gamma_q_layer0/BiasAdd:0' shape=(?, 256) dtype=float32>, <tf.Tensor 'policy_0/model_2/gamma_q_layer1/BiasAdd:0' shape=(?, 256) dtype=float32>, <tf.Tensor 'policy_0/model_2/gamma_q_layer2/BiasAdd:0' shape=(?, 256) dtype=float32>], <tf.Tensor 'policy_0/model_2/tf_op_layer_policy_0/add_2/policy_0/add_2:0' shape=(?, 256) dtype=float32>]

If it helps. I am trying to recreate this code from here into RLlib.

Can anyone please guide? Thanks

p.s the 2 util functions are from here

mvindiola1 commented 3 years ago

Hi @rfali,

If by gamma you mean the reward discount factor I do not think you can change that just using a model. You will also need to update the loss function. This is where the DQN loss function is computing the n_step discount factor. https://github.com/ray-project/ray/blob/1d834bcbe33c7714913fa06c7a7392c29eb7d71d/rllib/agents/dqn/dqn_tf_policy.py#L98.

The specific error you are getting is because you are returning a list of tensors and not a tensor for q_values. You need a q_values = tf.keras.layers.Concatenate()[q_values].

rfali commented 3 years ago

thanks @mvindiola1 for the reply. Yes by gamma I mean the reward discount factor. The model I am trying to make has not 1, but n number of gammas, and it would learn over each of the gamma to output a Q-value for that gamma. If there are 10 gammas, the network should (I think) output Q_values for all actions for each gamma. Pasting the model architecture from the paper quoted above:

If you see the CustomModel output summary/plot above, you see that looks like what it should be. or is it wrong? if num_gammas=8, the final layer has 8 heads, just as in the normal case of a single gamma, there is 1 head (meaning Q-value for each action).

Thank you for pointing out the loss function calculation, I will get to it once I get the model setup correctly (tensors etc.)

So as you suggested, I added this (after creating the n gamma_q_layers) q_values = tf.keras.layers.Concatenate()[q_values] however I get TypeError: 'Concatenate' object is not subscriptable

btw when you wrote _because you are returning a list of tensors and not a tensor for qvalues, don't you think that as per the figure above I should not be returning a tensor for q_values, rather a list of tensors (a tensor for each q_value)?

If you see this network , do you think I am doing the right thing with the RLlib DistributionQTFModel?

p.s Here are some parallels (I will add more as I find): argmax: source RLlib

mvindiola1 commented 3 years ago

@rfali, I will try and read the rest of your message soon. The syntax is just wrong: q_values = tf.keras.layers.Concatenate()(q_values) You will probably also have to make your action_space MultiDiscrete.

rfali commented 3 years ago

@mvindiola1 Here is a Colab I setup for running the model.

Let's say instead of 1 Q-Value, the model needs to output 2 Q-Values (based on different gammas). The action_space is let's say 6. So instead of the last layer have output=6, would we want to concatenate the last layers? Is that what you were suggesting?

If I look at the original paper (as well as the provided code), they have used it on ALE which has Discrete action space.

rfali commented 3 years ago

Here is what I could figure out how RLlib's DQN works under the hood:

When I build a custom model of DistributionalQTFModel Class, here is the data flow.

obs -> forward() -> model_out     
model_out -> get_q_value_distributions() -> Q(s, a) atoms    
model_out -> get_state_value() -> V(s)

The get_q_value_distributions() would take in model_out: TensorType and output (action_scores, logits, dist) as List[TensorType]

we compute the q_values in compute_q_values(). If I am not wrong, the value is still a tensor here. For the vanilla DQN (that I am using, also num_atoms=1), this is just the action_scores

the q_values is used here for the target_network in build_q_losses() and is named q_tp1.

so now we have a q_value tensor of dimension=num_actions, and here we take the argmax to get the action corresponding to the best q_value, which is named q_tp1_best.

The QLoss is calculated as follows: q_tp1_best is used to calculate q_tp1_best_masked and eventually we reach the line which you pointed to here where gamma is used to calculate target network's q_value. then td_error and loss is calculated.

Please correct me if my understanding is wrong at any place (it will help me avoid future bugs that I may introduce). I hope I am able to use this trail to change the gamma value used in loss calculation. But I need to get the correct model first.

rfali commented 3 years ago

I have completed the Colab to test the model with Atari Breakout. As expected, with no underlying changes to RLlib functions and running with num_gammas=5, the following error is reported in the error.txt

[36mray::DQN.__init__()[39m (pid=1244, ip=172.28.0.2) File "python/ray/_raylet.pyx", line 488, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 495, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 505, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 449, in ray._raylet.execute_task.function_executor File "/usr/local/lib/python3.7/dist-packages/ray/_private/function_manager.py", line 556, in actor_method_executor return method(__ray_actor, *args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/ray/rllib/agents/trainer_template.py", line 121, in __init__ Trainer.__init__(self, config, env, logger_creator) File "/usr/local/lib/python3.7/dist-packages/ray/rllib/agents/trainer.py", line 516, in __init__ super().__init__(config, logger_creator) File "/usr/local/lib/python3.7/dist-packages/ray/tune/trainable.py", line 98, in __init__ self.setup(copy.deepcopy(self.config)) File "/usr/local/lib/python3.7/dist-packages/ray/rllib/agents/trainer.py", line 707, in setup self._init(self.config, self.env_creator) File "/usr/local/lib/python3.7/dist-packages/ray/rllib/agents/trainer_template.py", line 153, in _init num_workers=self.config["num_workers"]) File "/usr/local/lib/python3.7/dist-packages/ray/rllib/agents/trainer.py", line 789, in _make_workers logdir=self.logdir) File "/usr/local/lib/python3.7/dist-packages/ray/rllib/evaluation/worker_set.py", line 98, in __init__ spaces=spaces, File "/usr/local/lib/python3.7/dist-packages/ray/rllib/evaluation/worker_set.py", line 357, in _make_worker spaces=spaces, File "/usr/local/lib/python3.7/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 514, in __init__ self._build_policy_map(policy_dict, policy_config) File "/usr/local/lib/python3.7/dist-packages/ray/rllib/evaluation/rollout_worker.py", line 1155, in _build_policy_map policy_map[name] = cls(obs_space, act_space, merged_conf) File "/usr/local/lib/python3.7/dist-packages/ray/rllib/policy/tf_policy_template.py", line 237, in __init__ get_batch_divisibility_req=get_batch_divisibility_req, File "/usr/local/lib/python3.7/dist-packages/ray/rllib/policy/dynamic_tf_policy.py", line 286, in __init__ is_training=in_dict["is_training"]) File "/usr/local/lib/python3.7/dist-packages/ray/rllib/agents/dqn/dqn_tf_policy.py", line 219, in get_distribution_inputs_and_class policy, model, {"obs": obs_batch}, state_batches=None, explore=explore) File "/usr/local/lib/python3.7/dist-packages/ray/rllib/agents/dqn/dqn_tf_policy.py", line 353, in compute_q_values dist) = model.get_q_value_distributions(model_out) File "/usr/local/lib/python3.7/dist-packages/ray/rllib/agents/dqn/distributional_q_tf_model.py", line 184, in get_q_value_distributions return self.q_value_head(model_out) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py", line 761, in __call__ self.name) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/input_spec.py", line 207, in assert_input_compatibility ' input tensors. Inputs received: ' + str(inputs)) ValueError: Layer model expects 1 input(s), but it received 5 input tensors. Inputs received: [<tf.Tensor 'default_policy/model_2/gamma_q_layer0/BiasAdd:0' shape=(?, 256) dtype=float32>, <tf.Tensor 'default_policy/model_2/gamma_q_layer1/BiasAdd:0' shape=(?, 256) dtype=float32>, <tf.Tensor 'default_policy/model_2/gamma_q_layer2/BiasAdd:0' shape=(?, 256) dtype=float32>, <tf.Tensor 'default_policy/model_2/gamma_q_layer3/BiasAdd:0' shape=(?, 256) dtype=float32>, <tf.Tensor 'default_policy/model_2/gamma_q_layer4/BiasAdd:0' shape=(?, 256) dtype=float32>]

So the question is, how can I simply connect one of these last 5 gammas into a single output_layer? That would cater for the largest_gamma case. For the hyperbolic case, I just need to calculate an integral over the q_values of last layer to get a new q_value, and connect that q_value to output_layer.

Any suggestions will be highly appreciated @sven1977 @mvindiola1 @ericl. I think that his particular implementation of multiple horizon got missed by the community, and could perhaps become part of RLlib suite of algorithms in the future. Thank you.

rfali commented 3 years ago

Hi RLlib Team,

I have made some progress in terms of what needs to be implemented in order to have a multi-headed DQN. This is what I think would be the major steps (for example a 3-headed DQN meaning num_gammas=3) and a few questions below:

I need to have a custom model sub-classed from DistributionalQTFModel (like the model I made above) which essentially means I need to modify the network to output 3 sets of Q-values, one for each gamma). I have done this and since the error says model expects 1 input(s), but it received 3 input tensors, I think this part works.
Modify the distributional_q_tf_model.py file to accept multiple tensors (should it be done here?)
Calculate the n-step return for each gamma
Calculate TD-Targets for each gamma
Calculate gamma_loss (separately) for each gamma (Q-head)
Aggregate the gamma_loss and scale (divide by num_gammas)
Compute gradient of total_loss.

I have the following questions:

For my custom model, should I subclass the DistributionalQTFModel as above or TFModelv2? I am implementing this version of DQN:double, dueling, PER, n-step TRUE, C51 and noisy FALSE.
In the distributional_q_tf_model.py, I need to change probably here to accept a list of TensorType objects instead of one TensorType? how can I do that?
Since I have to do the changes to the loss calculation which is done in dqn_tf_policy.py, how can I have have a custom policy while running an APEX agent? The QLoss is calculated here and assigned to the policy here. AFAIK I can define a custom distributional_q_tf_model.py but not sure how can I have a custom dqn_tf_policy like file?

Thank you.

rfali commented 3 years ago

@mvindiola1 @sven1977 what should I change here if I want to receive multiple q-heads instead of 1?

rfali commented 3 years ago

Reply from @sven1977 for below (No 3) received here in Ray Discourse.

Since I have to do the changes to the loss calculation which is done in dqn_tf_policy.py, how can I have have a custom policy while running an APEX agent? The QLoss is calculated here and assigned to the policy here. AFAIK I can define a custom distributional_q_tf_model.py but not sure how can I have a custom dqn_tf_policy like file?

Reply still awaited for below (No 2 is same as this thread on Ray Discourse and is critical for progress):

For my custom model, should I subclass the DistributionalQTFModel as above or TFModelv2? I am implementing this version of DQN:double, dueling, PER, n-step TRUE, C51 and noisy FALSE.

In the distributional_q_tf_model.py, I need to change probably here to accept a list of TensorType objects instead of one TensorType? how can I do that?

mvindiola1 commented 3 years ago

Hi @rfali,

What I would try next is to move the Q value layers out of the forward call. The call to forward would return layer 5.

If you do that and I am following the code logic correctly then what should happen is that layer 5 will be the input to get_q_value_distributions. You would then override the base get_q_value_distributions method with your own that computes the different Q_gamma_i. This should be straightforward in your case since each of the Qs branch from a common input layer.

You might have to then override build_action_value to compute the final action.

rfali commented 3 years ago

Hi @mvindiola1. Thank for your reply. I appreciate the community's help as I sense that the rllib's support team is stretched very thin.

I am not sure what you meant by "move the Q value layers out of the forward call" as I think I need the different Q-value layers as my output. Also "The call to forward would return layer 5" is not what I want, or perhaps I am not understanding what you meant by this.

You are right about the input to get_q_value_distributions, and overriding methods as I explained here. I have explained here on Ray Discourse what I am currently attempting, and stuck at.

To summarize, I need multiple Q-heads, and if I input that to rllib's get_q_value_distributions, it complains. As far as I understand the RLlib code logic, when i sub-class DistributionalQTFModel, the model is instantiated and it receives the model output at L#73. I think it is trying to setup an Input layer with the custom model's ouput (due to its shape being (num_outputs,).

I tried to concatenate the multiple-heads at this point, but the keras layer complained. Seemingly works fine when I use this code to instantiate multiple input layers in a separate notebook cell, as in

def build_model(num_input_layers, input_shape):
    inputs = []
    for i in range(num_input_layers):
        input = Input(shape=(input_shape,), name='input{0}'.format(i))
        inputs.append(input)
    concat = Concatenate(name='concat')(inputs)
    model = Model(inputs,concat)
    model.compile('sgd','categorical_crossentropy')
    return model

But when I do this at L#73 with this,

inputs = []
for i in range(num_gammas):
    input = tf.keras.layers.Input(shape=(num_outputs,), name='input{0}'.format(i))
    inputs.append(input)
out_layer = tf.keras.layers.Concatenate(name='concat')(inputs)
self.model_out = out_layer

q_out = build_action_value(name + "/action_value/", self.model_out)
self.q_value_head = tf.keras.Model(self.model_out, q_out)

it complains,

(pid=38071)   File "/home/farrukh/workspace/hdqn/hdqn_model.py", line 27, in __init__
(pid=38071)     super(AtariModel, self).__init__(
(pid=38071)   File "/home/farrukh/miniconda3/envs/env_hdqn/lib/python3.8/site-packages/ray/rllib/agents/dqn/distributional_q_tf_model.py", line 196, in __init__
(pid=38071)     self.q_value_head = tf.keras.Model(self.model_out, q_out)
(pid=38071)   File "/home/farrukh/miniconda3/envs/env_hdqn/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 517, in _method_wrapper
(pid=38071)     result = method(self, *args, **kwargs)
(pid=38071)   File "/home/farrukh/miniconda3/envs/env_hdqn/lib/python3.8/site-packages/tensorflow/python/keras/engine/functional.py", line 120, in __init__
(pid=38071)     self._init_graph_network(inputs, outputs)
(pid=38071)   File "/home/farrukh/miniconda3/envs/env_hdqn/lib/python3.8/site-packages/tensorflow/python/training/tracking/base.py", line 517, in _method_wrapper
(pid=38071)     result = method(self, *args, **kwargs)
(pid=38071)   File "/home/farrukh/miniconda3/envs/env_hdqn/lib/python3.8/site-packages/tensorflow/python/keras/engine/functional.py", line 203, in _init_graph_network
(pid=38071)     nodes, nodes_by_depth, layers, _ = _map_graph_network(
(pid=38071)   File "/home/farrukh/miniconda3/envs/env_hdqn/lib/python3.8/site-packages/tensorflow/python/keras/engine/functional.py", line 985, in _map_graph_network
(pid=38071)     raise ValueError('Graph disconnected: '
(pid=38071) ValueError: Graph disconnected: cannot obtain value for tensor Tensor("policy_0/input0:0", shape=(?, 256), dtype=float32) at layer "concat". The following previous layers were accessed without issue: []

rfali commented 3 years ago

So instead of outputting multiple heads, receiving them in DistributionalQTFModel, and trying to work with the separate Q-heads, I have concatenated the multiple heads into one layer as my custom model output. Then I modify the L#73 to receive a single tensor but with a modified shape shape=(num_outputs,) to shape=(num_outputs * number_of_gammas,)

I will then attempt to split this tensor in dqn_tf_policy either on:

get_distribution_inputs_and_class() L#218 or
build_q_losses() L#241 and L#248

and calculate the loss for each Q-head using its respective gamma value (which will be indexed). I am not sure what happens to the q-values in get_distribution_inputs_and_class(), but I can see that in order to change the loss function, I need to make changes in build_q_losses() .

rfali commented 3 years ago

This is the current model

class HyperbolicDQNModel5(DistributionalQTFModel):
    """Custom model for DQN."""

    def __init__(self, obs_space, action_space, num_outputs, model_config,
                name, number_of_gammas=3, **kw):
        super(HyperbolicDQNModel5, self).__init__(
            obs_space, action_space, num_outputs, model_config, name, **kw)

        # Define the core model layers which will be used by the other
        # output heads of DistributionalQModel
        self.inputs = tf.keras.layers.Input(
            shape=(84,84,4), name="observations")      
        self.inputs2 = tf.keras.layers.Input(
            shape=(2,), name='agent_indicator') 

        layer_1 = tf.keras.layers.Conv2D(filters=32,kernel_size=[8, 8],
                      strides=(4, 4),activation="relu",data_format='channels_last',
                      name='layer1')(self.inputs)
        layer_2 = tf.keras.layers.Conv2D(filters=64,kernel_size=[4, 4],
                      strides=(2, 2),activation="relu",data_format='channels_last',
                      name='layer2')(layer_1)
        layer_3 = tf.keras.layers.Conv2D(filters=64,kernel_size=[3, 3],
                      strides=(1, 1),activation="relu",data_format='channels_last',
                      name='layer3')(layer_2)
        layer_4 = tf.keras.layers.Flatten(name='layer4')(layer_3)
        concat_layer = tf.keras.layers.Concatenate()([layer_4, self.inputs2])
        layer_5 = tf.keras.layers.Dense(512,name="layer5",
                      activation=tf.nn.relu, 
                      kernel_initializer=normc_initializer(1.0))(concat_layer)

        q_values = []
        for i in range(number_of_gammas):
            gamma_q_value = tf.keras.layers.Dense(num_outputs,activation="linear",
                                              name = f'gamma_q_layer{i}',
                                              kernel_initializer=normc_initializer(1.0))(layer_5)
            q_values.append(gamma_q_value)

        q_values_concat = tf.keras.layers.Concatenate()(q_values)

        self.base_model = tf.keras.Model([self.inputs, self.inputs2], q_values_concat)  

    # Implement the core forward method.
    def forward(self, input_dict, state, seq_lens):
        model_out = self.base_model([input_dict["obs"][:,:,:,0:4], input_dict["obs"][:,0,0,4:6]])
        return model_out, state

stale[bot] commented 3 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 2 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!