[RLlib] Issues with RLModules/Learner + evaluation workers and not using KL loss

edcxan commented 1 year ago

What happened + What you expected to happen

Currently there are several bugs/incomplete code within RL module preventing it from working properly with multi-agent, multi-policy PPO (and possibly other algorithms.) Some of them I have managed to patch and they are:

In algorithm.py's add_policy, the learner api block needs to be moved under the evaluation workers block otherwise add_policy will always fail if evaluation workers exist, due topolicy = self.get_policy(policy_id) in the learner api block conflicting with policy and policy_cls (can only have 1) when adding to evaluation workers.
In algorithm.py's remove_policy, we need to add a block to remove modules from the learner group as well:

if self.config._enable_learner_api: self.learner_group.remove_module( module_id=policy_id, )

In ppo_learner.py's remove_module, the two coeff .pop's can fail if they do not exist, for example if not using entropy coeff scheduler. Should return None if the keys do not exist:

self.curr_kl_coeffs_per_module.pop(module_id, None) self.entropy_coeff_schedulers_per_module.pop(module_id, None)

In ppo_torch_learner.py's additional_update_for_module, assert sampled_kl_values, "Sampled KL values are empty." needs to be moved under a if hps.use_kl_loss: since they will not exist if not using KL loss.

Versions / Dependencies

ray==2.6.3

Reproduction script

Using any of the functions described with RL Module

Issue Severity

Low: It annoys or frustrates me.

ArturNiederfahrenhorst commented 1 year ago

Hi @edcxan, these are all valid points. Getting the RLModules / Learner stack is high priority for the RLlib team. Can you put up a PR with the items? Ideally, if you have a short script that creates errors without your changes but does not create errors with your changes, that would be great and accelerate the process. Do you think that would be possible?

Thanks for raising this issue in any case!

simonsays1980 commented 1 year ago

@edcxan, thanks for opening this issue. This is a good one :) The broader take here should be, ioo:

Move PPO from RolloutWorker/Policy API to the new EnvRunner (replaces RolloutWorker) + MARLModule (replaces PolicyMap) APIs. See DreamerV3's (albeit single-agent only) EnvRunner under algorithms.dreamerv3.utils.env_runner.py
Make the new EnvRunner: Use gymnasium.vector as the environment API (get rid of RLlib's own quirky Env APIs; again, use DreamerV3 as example) Use Connectors directly in the EnvRunner (<- this could be a phase II). Use DreamerV3's Episode class to store data temporarily (this makes data easily accessible by EnvRunner for compute action calls: forward_exploration/inference()) Pass data from ongoing Episode through Connectors and into RLModules for action computation. The user might configure a custom function that allows them to extract the "correct" data from the Episode given some timestep. This way, we can solve (and get rid of) the conundrum of the TrajectoryViewAPI via a simpler yet more powerful functional API. For example, should the user know that her model requires the last 10 rewards besides the observation, she can write a custom function to extract those data from the ongoing Episode object (and use 0-padding or any other solution for episode-edge cases). (<- this could be phase II) The same happens on the way back to the env: EnvRunner will use the EnvConnector to pass the computed action back to the environment. ** Maybe: Should the module return something from its get_internal_state() method, the EnvRunner might automatically handle RNN-state passing into the module's forward methods as well as storing the most recent state for the next call. Again, see DreamerV3's EnvRunner for a working example of such behavior. (<- this could be phase II; phase I w/o LSTM support)

ray-project / ray