Open simonsays1980 opened 1 year ago
@gjoliver As you are fluent in the connectors, what could be the best solution here?
Hey @simonsays1980 , thanks for opening this issue. This is a good one :) The broader take here should be, imo:
algorithms.dreamerv3.utils.env_runner.py
forward_exploration/inference()
)
Pass data from ongoing Episode through Connectors and into RLModules for action computation.
The user might configure a custom function that allows them to extract the "correct" data from the Episode given some timestep. This way, we can solve (and get rid of) the conundrum of the TrajectoryViewAPI via a simpler yet more powerful functional API. For example, should the user know that her model requires the last 10 rewards besides the observation, she can write a custom function to extract those data from the ongoing Episode object (and use 0-padding or any other solution for episode-edge cases). (<- this could be phase II)
The same happens on the way back to the env: EnvRunner will use the EnvConnector to pass the computed action back to the environment.
** Maybe: Should the module return something from its get_internal_state()
method, the EnvRunner might automatically handle RNN-state passing into the module's forward methods as well as storing the most recent state for the next call. Again, see DreamerV3's EnvRunner for a working example of such behavior. (<- this could be phase II; phase I w/o LSTM support)
What happened + What you expected to happen
What happened
When running a simple
PPO
training and then doing inference the following error occurs:Debugging the code showed that the problem is that the default
view_requirements
of thePPO
policy contains the"prev_rewards"
withshift_arr=array([0])
:agent_collector
is empty and gets initialized; specifically the"rewards"
get added via theget_dummy_batch_for_space()
in theadd_init_obs()
method of the collector. This is what you see in the example code below in the first print out of thebuffers["rewards"]
add_action_reward_next_obs()
method fo the collector gets called and does not ensure the"prev_reward"
to be present.reward
in thelocal_policy_inference()
function (see comments), but then the same problem occurs with the"agent_id"
."agent_index"
to the data batch in thelocal_policy_inference()
- which is already provided by the user in the call, but it needs a modification of this function.Nevertheless, is this something we should provide in inference? Imo it would be easier to provide this then on default as the user does usually not consider the default view requirements of the policy.
What I expected to happen
That not more then the
obs
are needed to be provided in inference of a non-stateful policy.Versions / Dependencies
Linux Fedora 37 Python 3.9.12 Ray nightly September 1st 2023 10:15
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.