Open AdityaGudimella opened 2 years ago
Hi, @AdityaGudimella thanks for your interest in contribution! We would definitely be interested in the TQC contribution.
Is it okay if I use the new
gym.Env
api whereenv.step
returnsobs, reward, terminated, truncated, info
instead ofobs, reward, done, info
?
It's ok but note the merging timeline may be considerably delayed because of this. Because poetry
only supports using a single version for gym
, we would need to make sure other scripts at least run with the latest gym. Judging by the list of things gym recently breaks, it's hard to estimate the amount of work involved... We might be able to merge it sooner with an ad-hoc instruction like poetry run pip install gym==0.24.1
in the usage section as a temporary solution. See example here
Is it okay if I deviate slightly from the style followed by the other algorithms in CleanRL? I would like to put the code for calculating the losses and updating the models into a separate class. Everything else will remain the same. This way, if people are familiar with the general RL procedure and only want to see how the core of the algorithm changes, they will only need to look at this class.
I still think it's worth it to adhere to the styles in other scripts for consistency unless the loss function is re-used multiple times. The problems with putting loss function in the class are
update_agent
is defined (we don't really do this because we would just literary put a comment like # update agent
.That said, I am happy to be convinced otherwise; feel free to prototype what you think is best and I will help review.
Hi, @vwxyzjn. Thank you for your response.
It's ok but note the merging timeline may be considerably delayed because of this. Because poetry only supports using a single version for gym, we would need to make sure other scripts at least run with the latest gym. Judging by the list of things gym recently breaks, it's hard to estimate the amount of work involved... We might be able to merge it sooner with an ad-hoc instruction like poetry run pip install gym==0.24.1 in the usage section as a temporary solution. See example here
That makes sense to me. I had to update gym from 0.23 to 0.25 in my own library, but it did require too many changes. Out of curiosity, if I put up a PR (separately) just for upgrading the gym version to 0.25.1, what would be required to do to merge it in? Would you be running all the algorithms currently present to convergence?
For now, I can see that TQC even without differentiating between terminated and truncated episodes, performs quite well. I can put up a PR first that uses the dones from the gym env as is. Once the rest of the PR is approved, I can change it to differentiate between terminated and truncated. Is that okay?
I still think it's worth it to adhere to the styles in other scripts for consistency unless the loss function is re-used multiple times. The problems with putting loss function in the class are ...
For your point 1., I don't think the class will have any extra lines of code. If I pass in the args SimpleNamespace
to the class constructor, the number of lines should be exactly the same as the other version. For point number 2. while I can see some people having to differentiate between the two, for me it makes it much more clearer to put the loss into a separate function / class. That being said, while I would prefer to put it in a separate class, I'm fine with following the format followed by the other algorithms too. I will first put up a version that has the loss in a separate class, and once you feel that there aren't any mistakes in it, I will refactor it to the format followed by the other algorithms. That way I don't have to be worried that my current implementation itself is wrong (it does reproduce - and in some cases improve upon, the performance as given in the paper, but you never know with RL). Is that okay?
I might have a few more questions while I write the PR. Is it okay to ask them here?
Out of curiosity, if I put up a PR (separately) just for upgrading the gym version to 0.25.1, what would be required to do to merge it in?
The main thing would be to ensure the scripts run without error.
Would you be running all the algorithms currently present to convergence?
No, this would not be required unless we are highly suspicious about a particular environment / script.
For now, I can see that TQC even without differentiating between terminated and truncated episodes, performs quite well. I can put up a PR first that uses the dones from the gym env as is. Once the rest of the PR is approved, I can change it to differentiate between terminated and truncated. Is that okay?
Yes this is ok :)
For your point 1., I don't think the class will have any extra lines of code. If I pass in the args SimpleNamespace to the class constructor, the number of lines should be exactly the same as the other version. For point number 2. while I can see some people having to differentiate between the two, for me it makes it much more clearer to put the loss into a separate function / class. That being said, while I would prefer to put it in a separate class, I'm fine with following the format followed by the other algorithms too. I will first put up a version that has the loss in a separate class, and once you feel that there aren't any mistakes in it, I will refactor it to the format followed by the other algorithms. That way I don't have to be worried that my current implementation itself is wrong (it does reproduce - and in some cases improve upon, the performance as given in the paper, but you never know with RL). Is that okay?
Perfectly reasonable.
I might have a few more questions while I write the PR. Is it okay to ask them here?
Here would be a perfect place to ask. Discord is also good.
I put up an initial PR here. I've tested it out on Hopper-v3 and Humaoid-v3. Please take a look at it.
I also created a new issue for the gym upgrade here.
Problem Description
Would you be interested in adding Truncated Quantile Critics to CleanRL? If so I can work on a PR.
If you are interested, then I have a few questions:
env.step
returnsobs, reward, terminated, truncated, info
instead ofobs, reward, done, info
?Checklist
poetry install
(see CleanRL's installation guideline.