Open ErwinLiYH opened 4 months ago
I have located the bug, It is caused by the function __process_resetted_obs_for_eval
of class EnvRunnerV2
in ray/rllib/evaluation/env_runner_v2.py
does not handle the raw info dict from reset operation correctly. This function will process obs and info from reset operation, but it only extracts agent obs from structure like:
{
env_id: {
agent_id: agent_obs
......
},
}
We need to extract info dict as well. Should I create a pull request to fix it?
What happened + What you expected to happen
I am trying to write a custom policy with a postprocess_trajectory to post-process infos. However, after one training iteration, the infos in the raw sample batch as the input of postprocess_trajectory is abnormal, the first info dict will become the following, and the info dicts from the second to last are correct.
I write a dummy env which will return info like:
The infos of the input sample batch to postprocess_trajectory after the first training iteration is:
This is an unexcepted behaviour. The code to reproduce this problem is as follows.
Versions / Dependencies
Ray: 2.32.0 Python: 3.10.14 Ubuntu 20.04
Reproduction script
I also tested it in the MARL setting, the first info dict will become:
and the whole info dicts of the input sample batch is:
It seems the first info dict has not been transferred to
SampleBatch
, it is still theMultiAgentBatch
Issue Severity
High: It blocks me from completing my task.