Closed KaleabTessera closed 1 month ago
Hey! Thanks a lot for this, it is also something that bothers me a lot.
Let's take navigation as example, but these points should be extendable to other scenarios
In Navigation, the reward function is desinged such that the total reward is then same for any point that is equidistant from the goal. Since the reward is the difference between the agent's distance to the goal in the current timstep and the prior one, any movement of $x$ meters in the direction of the goal is rewarded the same amount $\propto x$ (regardelss of what the absolute position is). This means that you can draw concentric circumferences around the goal and you are guaranteed that any 2 points on the same circumference will have the same total reward, regardless of how long you take to get there.
Thus, 2 agents that start in the same position, arrive to the same goal, have the same collisions, but take different amounts of times will have the same total reward. This means that your case where the total reward of B is greater than A because of B being slower should not happen.
This is confirmed by looking at "collection/agents/reward/episode_reward_mean" in the VMAS Navigation benchmarks https://wandb.ai/matteobettini/benchmarl-public?nw=nwusermatteobettini where slow and fast agents all get the same total reward
This is just to say that a good starting point is a good reward function. Functions that penalize the agent proportional to its goal distance (like in MPE) might encode an implicit time component, but are terrible for learning as movement towards the goal is rewarded differently at different absolut positions.
if you look at "collection/agents/reward/reward_mean" in the VMAS Navigation benchmarks https://wandb.ai/matteobettini/benchmarl-public?nw=nwusermatteobettini you can then see that, as you say, some approaches (the ones with discrete actions) get a smaller instantaneous reward. By looking at "eval/reward/episode_len_mean" you can see that this is because they take more steps.
The faster approaches take less steps and for each step they get high rewards, the slower approaches take more steps and for each step they get lower rewards.
Therefore, in this case, mean reward is able to show the time efficiency (as you said in the comment)
The big question here is why we need time penalties?
Are they needed for learning better? To this question I would answer: not really. In my experience penalising the agents over time does not really lead to better policies. Time efficiency in RL is already encoded in the discount factor. Time penalties would reinforce this... but then the discount factor would be applied to them too....making them smaller over time ... which is the opposite of what they mean.
I agree that they would make time inefficiency show up in the total reward, but it seems a bit convoluted to me to change the reward function (needed to distill behaviors) to resolve a reporting issue.
In the current state, as I hope is clear in the wandb, the total reward, episode length, and reward mean are all needed to get a clear picture of the task performance. They allow to have a separate look on task efficiency (episode_reward) and time efficiency (episode length) and a mix of the 2 (mean reward). If we added time penalites, the time penalty would mix in with the task rewards and we would loose this clear separation of task efficiency.
In brief, I agree that currently total reward does not show time inefficiencies and episode lengths is needed to disentangle algorithms on that frot. But adding time penalties would change the meaning of task success IMO and make us loose a good way to disentangle time and task perfromance. Plus, it would change the RL objective and I am not sure if for the best.
This is something that is very interesting so I am looking forward to see what you think and discuss. By no means this answer means that we should not do it, but it is just some thoughts I had
PS Also, note that this could be task dependent. There are scenarios in vmas that have time as part of their reaward (meaning that it is considered part of the task success)
Regarding your comment on perf of different algorithms being different according to different metrics, this should not happen.
In the worst case scenario, two algorithms would be getting the same total reward and then you have to look at episode length to disentangle. But it should never happen that their ranking switches when you choose what reward to plot (total or mean)
Thanks for the detailed response and interesting points, @matteobettini! :rocket:
If I understand correctly, agents taking longer, winding routes will get the same total reward as those taking a direct route. I see the reasoning here — introducing time could lead to inconsistent rewards and alter the objective, and the discount factor already partially addresses time.
However, since time often feels part of task success to me, I think incorporating it in a dense reward signal could make sense. Given we’re sampling across many environments and initial states, the agent should still maximize returns without being too biased by potential time inconsistencies. It may not be optimal, but it seems reasonable to me.
For now, as you suggest, looking at episode lengths and mean rewards could give a clearer picture. Logging discounted cumulative reward might be a better single metric to report.
I’ll double-check my results to ensure the rankings remain consistent across metrics. :+1:
I’ll definitely add logging of discounted rewards to BenchMARL.
when it comes to adding the time penalty, i think this is a subjective thing.
i would let scenario creators decide this themselves rather than having a centralized choice from vmas (also because they might want to tune the magnitudes of it in the task)
what do you think? Also because adding it in a centralized way now would change all the existing tasks and I am not super keen on doing that
Sounds good :+1:
In certain envs, like navigation, there is no time penalty.
This means comparing algorithms becomes quite tricky. For example, I have two algorithms A and B. When looking at the total reward (return) in an episode, B has slightly better performance, but when looking at mean reward, A has much better performance.
The better total reward (return) result is because B runs for longer and so accumulates more reward. Depending on the metric used, the better algorithm will be different, which is not ideal.
A lot of papers use return, but some use mean reward. For consistency, I think it would be better if these results were consistent, which adding a time penalty would likely do.