vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)
http://docs.cleanrl.dev
Other
5.54k stars 631 forks source link

Add TQC to CleanRL #258

Open AdityaGudimella opened 2 years ago

AdityaGudimella commented 2 years ago

Problem Description

Would you be interested in adding Truncated Quantile Critics to CleanRL? If so I can work on a PR.

If you are interested, then I have a few questions:

  1. Is it okay if I use the new gym.Env api where env.step returns obs, reward, terminated, truncated, info instead of obs, reward, done, info?
  2. Is it okay if I deviate slightly from the style followed by the other algorithms in CleanRL? I would like to put the code for calculating the losses and updating the models into a separate class. Everything else will remain the same. This way, if people are familiar with the general RL procedure, and only want to see how the core of the algorithm changes, they will only need to look at this class.

Checklist

vwxyzjn commented 2 years ago

Hi, @AdityaGudimella thanks for your interest in contribution! We would definitely be interested in the TQC contribution.

Is it okay if I use the new gym.Env api where env.step returns obs, reward, terminated, truncated, info instead of obs, reward, done, info?

It's ok but note the merging timeline may be considerably delayed because of this. Because poetry only supports using a single version for gym, we would need to make sure other scripts at least run with the latest gym. Judging by the list of things gym recently breaks, it's hard to estimate the amount of work involved... We might be able to merge it sooner with an ad-hoc instruction like poetry run pip install gym==0.24.1 in the usage section as a temporary solution. See example here

Is it okay if I deviate slightly from the style followed by the other algorithms in CleanRL? I would like to put the code for calculating the losses and updating the models into a separate class. Everything else will remain the same. This way, if people are familiar with the general RL procedure and only want to see how the core of the algorithm changes, they will only need to look at this class.

I still think it's worth it to adhere to the styles in other scripts for consistency unless the loss function is re-used multiple times. The problems with putting loss function in the class are

  1. the class will have much more lines of code
  2. the overhead of jumping from the bottom of the script to see how update_agent is defined (we don't really do this because we would just literary put a comment like # update agent.

That said, I am happy to be convinced otherwise; feel free to prototype what you think is best and I will help review.

AdityaGudimella commented 2 years ago

Hi, @vwxyzjn. Thank you for your response.

It's ok but note the merging timeline may be considerably delayed because of this. Because poetry only supports using a single version for gym, we would need to make sure other scripts at least run with the latest gym. Judging by the list of things gym recently breaks, it's hard to estimate the amount of work involved... We might be able to merge it sooner with an ad-hoc instruction like poetry run pip install gym==0.24.1 in the usage section as a temporary solution. See example here

That makes sense to me. I had to update gym from 0.23 to 0.25 in my own library, but it did require too many changes. Out of curiosity, if I put up a PR (separately) just for upgrading the gym version to 0.25.1, what would be required to do to merge it in? Would you be running all the algorithms currently present to convergence?

For now, I can see that TQC even without differentiating between terminated and truncated episodes, performs quite well. I can put up a PR first that uses the dones from the gym env as is. Once the rest of the PR is approved, I can change it to differentiate between terminated and truncated. Is that okay?

I still think it's worth it to adhere to the styles in other scripts for consistency unless the loss function is re-used multiple times. The problems with putting loss function in the class are ...

For your point 1., I don't think the class will have any extra lines of code. If I pass in the args SimpleNamespace to the class constructor, the number of lines should be exactly the same as the other version. For point number 2. while I can see some people having to differentiate between the two, for me it makes it much more clearer to put the loss into a separate function / class. That being said, while I would prefer to put it in a separate class, I'm fine with following the format followed by the other algorithms too. I will first put up a version that has the loss in a separate class, and once you feel that there aren't any mistakes in it, I will refactor it to the format followed by the other algorithms. That way I don't have to be worried that my current implementation itself is wrong (it does reproduce - and in some cases improve upon, the performance as given in the paper, but you never know with RL). Is that okay?

I might have a few more questions while I write the PR. Is it okay to ask them here?

vwxyzjn commented 2 years ago

Out of curiosity, if I put up a PR (separately) just for upgrading the gym version to 0.25.1, what would be required to do to merge it in?

The main thing would be to ensure the scripts run without error.

Would you be running all the algorithms currently present to convergence?

No, this would not be required unless we are highly suspicious about a particular environment / script.

For now, I can see that TQC even without differentiating between terminated and truncated episodes, performs quite well. I can put up a PR first that uses the dones from the gym env as is. Once the rest of the PR is approved, I can change it to differentiate between terminated and truncated. Is that okay?

Yes this is ok :)

For your point 1., I don't think the class will have any extra lines of code. If I pass in the args SimpleNamespace to the class constructor, the number of lines should be exactly the same as the other version. For point number 2. while I can see some people having to differentiate between the two, for me it makes it much more clearer to put the loss into a separate function / class. That being said, while I would prefer to put it in a separate class, I'm fine with following the format followed by the other algorithms too. I will first put up a version that has the loss in a separate class, and once you feel that there aren't any mistakes in it, I will refactor it to the format followed by the other algorithms. That way I don't have to be worried that my current implementation itself is wrong (it does reproduce - and in some cases improve upon, the performance as given in the paper, but you never know with RL). Is that okay?

Perfectly reasonable.

I might have a few more questions while I write the PR. Is it okay to ask them here?

Here would be a perfect place to ask. Discord is also good.

AdityaGudimella commented 2 years ago

I put up an initial PR here. I've tested it out on Hopper-v3 and Humaoid-v3. Please take a look at it.

AdityaGudimella commented 2 years ago

I also created a new issue for the gym upgrade here.