kevslinger / DTQN

Deep Transformer Q-Networks for Partially Observable Reinforcement Learning
MIT License
127 stars 19 forks source link

Replacing `done` with `truncated` and `terminated` #10

Open ashok-arora opened 2 weeks ago

ashok-arora commented 2 weeks ago

Hey Kevin, I hope you are doing well. I noticed a small bug where the step function returns only obs, reward, done, info instead of the obs, reward, terminated, truncated, info. I came across this article from gymansium that emphasised the need for both terminated and truncated. Can I help in updating the codebase?

kevslinger commented 2 weeks ago

Hi Ashok,

This is a somewhat complicated issue. In the older versions of gym, this truncated variable was stored inside info (see https://github.com/kevslinger/DTQN/blob/79ccf8b548a2f6263b770e051b42ced2932137ee/run.py#L371). This new gym version with 5 returns values is a breaking change, and updating to it would mean one could no longer run DTQN on the original environments. At the same time, leaving it in its current state would mean you cannot run DTQN on "modern" (updated) environments.

With that in mind, there are a few ways we can proceed with this: 1.) Do nothing, and leave the repo in its current (outdated) state. 2.) Update the repo to account for the new version of gym, while keeping the paper branch in its original, old form. Then experiments could be run on the main branch with new environments, and the paper branch with old environments. 3.) Create a new branch from either the main branch and/or paper branch to accommodate for new gym environments and allow people to switch between branches based on the version of their environment. 4.) Come up with a sort of gym wrapper to interface between the two styles of environments. We could either wrap the old environments so that they now return obs, reward, terminated, truncated, info instead of obs, reward, done, info, and then update the codebase accordingly, OR we could wrap the new environments so that they return obs, reward, done, info and put truncated inside the info variable, the same as the old method.

Probably option 4 is the best way. Let me know what you think. I of course will welcome pull requests towards this issue

ashok-arora commented 1 week ago

Thank you for the detailed response.

Yeah, the fourth option seems like the straightforward way since splitting branches may lead to confusion for future users. To maintain compatibility, we can pass in an arg of maybe --gym old (returns obs, reward, done, info) or --gym new (returns obs, reward, terminated, truncated, info) that handles it accordingly, with --gym old being the default to maintain compatibility with the paper branch.

On a sidenote, I am wondering how https://github.com/kevslinger/DTQN/blob/79ccf8b548a2f6263b770e051b42ced2932137ee/run.py#L371 is able to calculate the maximum time that should be decided for the truncated part. Would it make more sense to define the time limit ourselves depending upon the the velocity and the distance between heaven and hell?

kevslinger commented 1 week ago

That sounds like a good idea, we can use the gym-version flag to determine if we need to wrap the environment in a gym wrapper to shrink the size of the returned tuple from 5 arguments to 4.

As for the TimeLimit.truncated part, that is a bit of old gym magic -- by supplying the max_episode_steps argument when adding an environment to the gym register, the backend wraps the environment in a TimeLimitWrapper which then ends the episode and sets TimeLimit.truncated: True in the info as part of the return of the step function. For example, https://github.com/kevslinger/DTQN/blob/79ccf8b548a2f6263b770e051b42ced2932137ee/envs/__init__.py#L47 means the episode will always end after 200 steps. I don't think we need to do anything to change that

ashok-arora commented 1 week ago

That sounds like a good idea, we can use the gym-version flag to determine if we need to wrap the environment in a gym wrapper to shrink the size of the returned tuple from 5 arguments to 4.

Okay sure, I'll send in a PR with the gym-version flag.

As for the TimeLimit.truncated part, that is a bit of old gym magic -- by supplying the max_episode_steps argument when adding an environment to the gym register, the backend wraps the environment in a TimeLimitWrapper which then ends the episode and sets TimeLimit.truncated: True in the info as part of the return of the step function. For example,

https://github.com/kevslinger/DTQN/blob/79ccf8b548a2f6263b770e051b42ced2932137ee/envs/__init__.py#L47

means the episode will always end after 200 steps. I don't think we need to do anything to change that

and what is the basis for the choice of 200 for max_episode_steps, and similarly other values for the other environments?

kevslinger commented 1 week ago

In general, we want to give the agents the ability to successfully end an episode while exploring, so it the max_episode_steps needs to be large enough such that this is possible. However, we don't want to allow the episodes to go on forever, as this may fill the replay buffer with many useless experiences. This is not a value I spent a lot of time tuning