jurgisp / memory-maze

Evaluating long-term memory of reinforcement learning algorithms
MIT License
129 stars 13 forks source link

Mean Max Reward #31

Closed dirkmcpherson closed 5 months ago

dirkmcpherson commented 6 months ago

Hi! You report mean max reward in your paper for 1000 steps @ 4 actions / step (4hz). If I was running this environment for 1000 steps @ 1 action / step (1hz) is it fair to assume the mean max rewards will be a fourth of what you've reported?

Great environment! Thanks!

jurgisp commented 5 months ago

Actually, I think it will be 4x, not a fourth.

The reason being, with 1000 steps at 4 Hz the episode runs for 250 sec, and with 1000 steps at 1 Hz the episode will run for 1000 sec. Since the speed at which the agent moves around is constant in "physics time", that will give agent 4x more time to collect rewards.

I would suggest to decrease the episode length in steps to 250, so that it stays the same in seconds, so you can compare with baseline results.

Also note, that the environment becomes easier with 1 Hz, because agent needs shorter memory in number of steps, so perhaps you should get even higher score than current baselines (but not higher than Oracle:)

dirkmcpherson commented 5 months ago

Ah I see, I misread the paper and interpreted 4hz as running at action repeat 4. I am not doing nearly as well as I thought.

So when I give an action to the default MemoryMaze environment, the agent follows that action for 1/4 of an environmental physics-second?

jurgisp commented 5 months ago

Yes, 4Hz means 4 actions per physics-second, so each action (which translates into a torque in MuJoCo) is applied for 0.25s.

jurgisp commented 4 months ago

I just realized a problem you might have with 1Hz (i.e. executing each action for 1sec) - that's probably too big of a step for turning with enough precision. Imagine pressing left or right for 1/4 sec turns by ~15 degrees (ballpark, I am not sure what it is exactly), then turning for 1sec would turn by 60 degrees, which would make it hard for the agent to navigate.

dirkmcpherson commented 4 months ago

Hi! I actually misunderstood your environmental hz as being action repeat. I thought your max reward was set for the baseline environment with action_repeat=4 rather than physics being run at 4hz.

You're right though, when I ran your environment at action_repeat=4 it was much harder to navigate, that probably should've tipped me off to my misunderstanding :)