Open dxu23nc opened 4 years ago
@ericl @richardliaw It seems that the above issue has gone away with recent ray versions, not sure what resolved it.
ray version 1.3.0, tensorflow 2.34.1
If I run the above code
import ray
from ray.rllib.agents import dqn
ray.init()
config= dqn.DEFAULT_CONFIG.copy()
config["num_workers"] = 1
config["hiddens"] = [512,128]
trainer = dqn.DQNTrainer(config=config,env="AirRaid-ram-v0")
model = trainer.get_policy().model
model.base_model.summary()
model.q_value_head.summary()
I get the following output
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
observations (InputLayer) [(None, 128)] 0
__________________________________________________________________________________________________
fc_1 (Dense) (None, 256) 33024 observations[0][0]
__________________________________________________________________________________________________
fc_out (Dense) (None, 256) 65792 fc_1[0][0]
__________________________________________________________________________________________________
value_out (Dense) (None, 1) 257 fc_1[0][0]
==================================================================================================
Total params: 99,073
Trainable params: 99,073
Non-trainable params: 0
__________________________________________________________________________________________________
Model: "model_1"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
model_out (InputLayer) [(None, 256)] 0
__________________________________________________________________________________________________
hidden_0 (Dense) (None, 512) 131584 model_out[0][0]
__________________________________________________________________________________________________
hidden_1 (Dense) (None, 128) 65664 hidden_0[0][0]
__________________________________________________________________________________________________
dense (Dense) (None, 6) 774 hidden_1[0][0]
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(2,)] 0 dense[0][0]
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(2,)] 0 dense[0][0]
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(None, 6)] 0 tf_op_layer_default_policy/ones_l
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(None, 6)] 0 tf_op_layer_default_policy/ones_l
__________________________________________________________________________________________________
tf_op_layer_default_policy/Expa [(None, 6, 1)] 0 tf_op_layer_default_policy/ones_l
__________________________________________________________________________________________________
tf_op_layer_default_policy/Expa [(None, 6, 1)] 0 tf_op_layer_default_policy/ones_l
==================================================================================================
Total params: 198,022
Trainable params: 198,022
Non-trainable params: 0
which seems accurate to me. The above output corresponds to
[base_model] obs_space_size -> 2 dense layers of 256 each (last being called fc_out)
[q_head] fc_out -> 2 q_hidden layers [512, 128] -> Q_values of action_space_size
i.e.
[128 observations] -> [256] -> [256] -> [512] -> [128] -> [6 actions]
This is because the default value of fcnet_hiddens
is [256,256]. The last few layers in the q_head correspond to the layers if distributional Q-learning (C-51) is used.
The q_hiddens
or basically config["hiddens"]
(this inconsistency should be addressed) is the number of dense layers after the Advantage-Value split, as documented in the DistributionalQTFModel.
If you set Dueling to True
import ray
from ray.rllib.agents import dqn
ray.init()
config= dqn.DEFAULT_CONFIG.copy()
config["num_workers"] = 1
config["hiddens"] = [512,128]
config["dueling"] = True
trainer = dqn.DQNTrainer(config=config,env="AirRaid-ram-v0")
model = trainer.get_policy().model
model.base_model.summary()
model.q_value_head.summary()
model.state_value_head.summary()
we get
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
observations (InputLayer) [(None, 128)] 0
__________________________________________________________________________________________________
fc_1 (Dense) (None, 256) 33024 observations[0][0]
__________________________________________________________________________________________________
fc_out (Dense) (None, 256) 65792 fc_1[0][0]
__________________________________________________________________________________________________
value_out (Dense) (None, 1) 257 fc_1[0][0]
==================================================================================================
Total params: 99,073
Trainable params: 99,073
Non-trainable params: 0
__________________________________________________________________________________________________
Model: "model_1"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
model_out (InputLayer) [(None, 256)] 0
__________________________________________________________________________________________________
hidden_0 (Dense) (None, 512) 131584 model_out[0][0]
__________________________________________________________________________________________________
hidden_1 (Dense) (None, 128) 65664 hidden_0[0][0]
__________________________________________________________________________________________________
dense (Dense) (None, 6) 774 hidden_1[0][0]
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(2,)] 0 dense[0][0]
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(2,)] 0 dense[0][0]
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(None, 6)] 0 tf_op_layer_default_policy/ones_l
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(None, 6)] 0 tf_op_layer_default_policy/ones_l
__________________________________________________________________________________________________
tf_op_layer_default_policy/Expa [(None, 6, 1)] 0 tf_op_layer_default_policy/ones_l
__________________________________________________________________________________________________
tf_op_layer_default_policy/Expa [(None, 6, 1)] 0 tf_op_layer_default_policy/ones_l
==================================================================================================
Total params: 198,022
Trainable params: 198,022
Non-trainable params: 0
__________________________________________________________________________________________________
Model: "model_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
model_out (InputLayer) [(None, 256)] 0
_________________________________________________________________
dense_1 (Dense) (None, 512) 131584
_________________________________________________________________
dense_2 (Dense) (None, 128) 65664
_________________________________________________________________
dense_3 (Dense) (None, 1) 129
=================================================================
Total params: 197,377
Trainable params: 197,377
Non-trainable params: 0
which corresponds to
shared layer
[128 observations] -> [256] -> [256] (this splits into following two layers)
-> [512] -> [128] -> [6 Q values for the 6 possible actions]
-> [512] -> [128] -> [1 value]
The last [256]
is the common layer, and splits into two branches, one each for Advantage and Value streams, as described in the Dueling DQN paper.
However, I have stumbled into a corner case, which could be a lack of understanding on my part. If you set hiddens
to be an empty list, and set dueling
True, then the V(s)
is used from the state_value_head
here. This makes for an interesting case since this value is not connected to the last shared dense layer, but actually the Q_head, which would be wrong. Here is a reproduction code.
# CartPole with q-hiddens ON and OFF
import ray
from ray.rllib.agents import dqn
ray.init()
config= dqn.DEFAULT_CONFIG.copy()
config["num_workers"] = 1
config["model"]["fcnet_hiddens"] = [64, 64]
config["hiddens"] = [] #[128, 128]
config["dueling"] = True
trainer = dqn.DQNTrainer(config=config,env="CartPole-v0")
model = trainer.get_policy().model
model.base_model.summary()
model.q_value_head.summary()
model.state_value_head.summary()
with config["hiddens"] = []
, we have the following output
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
observations (InputLayer) [(None, 4)] 0
__________________________________________________________________________________________________
fc_1 (Dense) (None, 64) 320 observations[0][0]
__________________________________________________________________________________________________
fc_2 (Dense) (None, 64) 4160 fc_1[0][0]
__________________________________________________________________________________________________
fc_out (Dense) (None, 2) 130 fc_2[0][0]
__________________________________________________________________________________________________
value_out (Dense) (None, 1) 65 fc_2[0][0]
==================================================================================================
Total params: 4,675
Trainable params: 4,675
Non-trainable params: 0
__________________________________________________________________________________________________
Model: "model_1"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
model_out (InputLayer) [(None, 2)] 0
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(2,)] 0 model_out[0][0]
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(2,)] 0 model_out[0][0]
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(None, 2)] 0 tf_op_layer_default_policy/ones_l
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(None, 2)] 0 tf_op_layer_default_policy/ones_l
__________________________________________________________________________________________________
tf_op_layer_default_policy/Expa [(None, 2, 1)] 0 tf_op_layer_default_policy/ones_l
__________________________________________________________________________________________________
tf_op_layer_default_policy/Expa [(None, 2, 1)] 0 tf_op_layer_default_policy/ones_l
==================================================================================================
Total params: 0
Trainable params: 0
Non-trainable params: 0
__________________________________________________________________________________________________
Model: "model_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
model_out (InputLayer) [(None, 2)] 0
_________________________________________________________________
dense (Dense) (None, 1) 3
=================================================================
Total params: 3
Trainable params: 3
Non-trainable params: 0
_________________________________________________________________
The part in model_2 where the state_value_head
(the last layer named dense) is (supposedly) connected to the fc_out
layer of shape (None, 2) i.e. the mode_out of the base model) instead of being connected to fc_2 (Dense) of shape (None, 64) is what is bothering me.
Note that the value_out
layer in the base model (first model in the above output) i.e. value_out (Dense) of shape (None, 1) and size 65 would not be used here since dueling is set to True, and instead of the state_value_head model
output would be used (as can be seen in the dqn_tfpolicy code here
Just to present the alternate here as well, we get the following output with hiddens
set to [128, 128]
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
observations (InputLayer) [(None, 4)] 0
__________________________________________________________________________________________________
fc_1 (Dense) (None, 64) 320 observations[0][0]
__________________________________________________________________________________________________
fc_out (Dense) (None, 64) 4160 fc_1[0][0]
__________________________________________________________________________________________________
value_out (Dense) (None, 1) 65 fc_1[0][0]
==================================================================================================
Total params: 4,545
Trainable params: 4,545
Non-trainable params: 0
__________________________________________________________________________________________________
Model: "model_1"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
model_out (InputLayer) [(None, 64)] 0
__________________________________________________________________________________________________
hidden_0 (Dense) (None, 128) 8320 model_out[0][0]
__________________________________________________________________________________________________
hidden_1 (Dense) (None, 128) 16512 hidden_0[0][0]
__________________________________________________________________________________________________
dense (Dense) (None, 2) 258 hidden_1[0][0]
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(2,)] 0 dense[0][0]
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(2,)] 0 dense[0][0]
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(None, 2)] 0 tf_op_layer_default_policy/ones_l
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(None, 2)] 0 tf_op_layer_default_policy/ones_l
__________________________________________________________________________________________________
tf_op_layer_default_policy/Expa [(None, 2, 1)] 0 tf_op_layer_default_policy/ones_l
__________________________________________________________________________________________________
tf_op_layer_default_policy/Expa [(None, 2, 1)] 0 tf_op_layer_default_policy/ones_l
==================================================================================================
Total params: 25,090
Trainable params: 25,090
Non-trainable params: 0
__________________________________________________________________________________________________
Model: "model_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
model_out (InputLayer) [(None, 64)] 0
_________________________________________________________________
dense_1 (Dense) (None, 128) 8320
_________________________________________________________________
dense_2 (Dense) (None, 128) 16512
_________________________________________________________________
dense_3 (Dense) (None, 1) 129
=================================================================
Total params: 24,961
Trainable params: 24,961
Non-trainable params: 0
Of course this won't happen with the default value of hiddens
set already to 256 which enforces a dense layer after the last fcnet_hidden layer, a detail which is obscured for a beginner.
The problem is that with the default DQN settings (dueing
is set to True, hiddens
set to [256] whereas fcnet_hiddens
is also [256,256] in model.catalog), the model becomes
[obs_space_size] -> [256] -> [256] -> [256] -> Q-values
Or for that matter, when a user specifies fcnet_hiddens = [64]
as in the cartpole dqn tuned example, what they expect is a
[obs_space_size] -> [64] -> Q-values
but they get
[obs_space_size] -> [64] -> [256] -> Q-values
If my understanding of the rllib architecture is correct, this is not what the user actually asked for (or at least thought of). Even if you set dueling
to False, you will get the same model architecture as above. The following is a reproduction code
import ray
from ray.rllib.agents import dqn
ray.init()
config= dqn.DEFAULT_CONFIG.copy()
config["num_workers"] = 1
config["model"]["fcnet_hiddens"] = [64]
config["dueling"] = False
trainer = dqn.DQNTrainer(config=config,env="CartPole-v0")
model = trainer.get_policy().model
model.base_model.summary()
model.q_value_head.summary()
Output
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
observations (InputLayer) [(None, 4)] 0
__________________________________________________________________________________________________
fc_out (Dense) (None, 64) 320 observations[0][0]
__________________________________________________________________________________________________
value_out (Dense) (None, 1) 5 observations[0][0]
==================================================================================================
Total params: 325
Trainable params: 325
Non-trainable params: 0
__________________________________________________________________________________________________
Model: "model_1"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
model_out (InputLayer) [(None, 64)] 0
__________________________________________________________________________________________________
hidden_0 (Dense) (None, 256) 16640 model_out[0][0]
__________________________________________________________________________________________________
dense (Dense) (None, 2) 514 hidden_0[0][0]
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(2,)] 0 dense[0][0]
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(2,)] 0 dense[0][0]
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(None, 2)] 0 tf_op_layer_default_policy/ones_l
__________________________________________________________________________________________________
tf_op_layer_default_policy/ones [(None, 2)] 0 tf_op_layer_default_policy/ones_l
__________________________________________________________________________________________________
tf_op_layer_default_policy/Expa [(None, 2, 1)] 0 tf_op_layer_default_policy/ones_l
__________________________________________________________________________________________________
tf_op_layer_default_policy/Expa [(None, 2, 1)] 0 tf_op_layer_default_policy/ones_l
==================================================================================================
Total params: 17,154
Trainable params: 17,154
Non-trainable params: 0
One solution could be that If the hiddens
is only meant for the dueling architecture, it could be used only if the dueling
is also set to True. AFAIK, this is not the case, as can be seen from the above output.
Since this is a design level question, I am going to tag @sven1977 and @ericl to see if they can clarify. Thank you very much for reading through this. I appreciate your help in furthering my understanding.
Can this be related (although above question is not about VisionNets)? PR#18306
built-in VisionNets: The value branch - in case we have a shared one - is attached to the action-logits output instead of the feature output (one layer before the action logits output).
There definitely appears to be some design inconsistency in how the models are ultimately created, when config["hiddens"]=[]
or when config["hiddens"]=<something non-empty>
.
If:
observations (InputLayer) =[(None, 16)]
config["model"]["fcnet_hiddens"] = [20, 17, 10]
num_actions=3
config["hiddens"]=[]
results in
base_model: [Nx16]->[Nx20]->[Nx17]->[Nx10]-> (which splits into [Nx1] and [Nx3])
q_value_head: [Nx3] -> (splits into two of [Nx3x1] )
state_value_head: [Nx3] -> [Nx1]
whereas config["hiddens"]=[256]
results in
base_model: [Nx16]->[Nx20]->[Nx17] -> (splits into [Nx1] and into [Nx10])
q_value_head: [Nx10]->[Nx256]->(splits into two of [Nx3x1] )
state_value_head: [Nx10]->[Nx256]->[Nx1]
With everything, especially theconfig["model"]["fcnet_hiddens"]
identical, why should state_value_head
input size be [Nx3[ in one while [Nx10] in another?
From the consistency point of view, for state_value_head
EITHER
A. [Nx3]->[Nx1]
(with config["hiddens"]=[]
) implies that [Nx3]->[Nx256]->[Nx1]
(with config["hiddens"]=[256]
)
OR
B. [Nx10]->[Nx256]->[Nx1]
(with config["hiddens"]=[256]
) implies that [Nx10]->[Nx1]
(with config["hiddens"]=[]
)
Similar arguments could be made for q_value_head
for base_model
in terms of the size of the input they use and produce respectively.
Most critically,
the value_head
of base_model
(with config["hiddens"]=[256]
) uses the 10-sized fc_net_hiddens[-1], while in the other base_model
bypassees it completely to compute value_head
.
Here's how the two models (hiddens=[]
v/s hiddens=[256]
) look like pictorially, with all other hyperparameters identical otherwise.
Please correct me if I am wrong, but if one was to assume the 3 constituent models obtained with hiddens=[256]
to be the right implementation, then with hiddens=[]
, one would only expect that only the dense layer with 256 nodes gets deleted, with no change in the interface between the 3 constituent models. If you see the interfacing though (i.e., output of base_model = input of state_value_head = input of q_value_head=) the interface size changes from 10 to 3 for hiddens=[256]
and hiddens=[]
respectively. Wouldn't it be easier if fcnet_hiddens
layers were to dedicatedly compute value_out
and advantages/q
consistently? On the other hand, if we were to say that fc_out computes features and not advantages or q, wouldn't it be easier if the last layer of fcnet_hiddens (of size 10) was to dedicatedly compute the features consistently in both case (hiddens =[256] v/s []), for the consistency sake?
Hi, any comments to what vedhasam described?
I noticed that there are two model configuration parameters: fcnet_hiddens in model_config and hiddens in the configuration of dqn.py. I try to print the whole model (model.base_model.summary()), it shows that the "hiddens" do not show up. Why did that happen? I know this "hiddens" is used in function build_q_model dqn.py. The code is as follows: ` import ray from ray import tune import functools import pickle import threading
from ray.rllib.agents import dqn ray.init() config= dqn.DEFAULT_CONFIG.copy() config["num_workers"] = 1 config["hiddens"] = [512,128] trainer = dqn.DQNTrainer(config=config,env="AirRaid-ram-v0") model = trainer.get_policy().model model.base_model.summary() model.q_value_head.summary() `
And the log is: Model: "model"
Layer (type) Output Shape Param Connected to
observations (InputLayer) [(None, 128)] 0
fc_1 (Dense) (None, 256) 33024 observations[0][0]
fc_out (Dense) (None, 256) 65792 fc_1[0][0]
value_out (Dense) (None, 1) 257 fc_1[0][0]
Total params: 99,073 Trainable params: 99,073 Non-trainable params: 0
Model: "model_1"
Layer (type) Output Shape Param model_out (InputLayer) [(None, 256)] 0
lambda (Lambda) [(None, 6), (None, 6, 1), 198022 Total params: 198,022 Trainable params: 198,022 Non-trainable params: 0