Hi, thank you for sharing your great work! I have a question about the implementation of policy and value networks.
In the paper, both the policy and value networks are described as having shared backborn models across all policies and are conditioned on the policy ID by adding the ID to the input. However, when I look into rl_games/rl_games/algos_torch/models.py, the policy networks seems like implemented as independent networks, and the value network appears to be fully shared without any conditioning based on the policy ID. This seems to differ from the approach described in the paper.
I would appreciate clarification on whether this change was intentional or if there is another location in the code where the ID-based conditioning is implemented.
Additionaly, it seems like only one ModelA2CContinuousLogStd model is used for both PPO and SAPG? not ModelMultiA2CContinuousLogStd. Could you tell me how SAPG algorithm is implemented?
Hi, thank you for sharing your great work! I have a question about the implementation of policy and value networks.
In the paper, both the policy and value networks are described as having shared backborn models across all policies and are conditioned on the policy ID by adding the ID to the input. However, when I look into
rl_games/rl_games/algos_torch/models.py
, the policy networks seems like implemented as independent networks, and the value network appears to be fully shared without any conditioning based on the policy ID. This seems to differ from the approach described in the paper.I would appreciate clarification on whether this change was intentional or if there is another location in the code where the ID-based conditioning is implemented.
Naoki