vwxyzjn / cleanrl

High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG)
http://docs.cleanrl.dev
Other
5.54k stars 631 forks source link

SAC-discrete implementation #270

Closed timoklein closed 1 year ago

timoklein commented 2 years ago

Description

Adds the SAC-discrete algorithm as discussed in https://github.com/vwxyzjn/cleanrl/issues/266.

Types of changes

Checklist:

If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.

vercel[bot] commented 2 years ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated
cleanrl ✅ Ready (Inspect) Visit Preview 💬 Add your feedback Jan 13, 2023 at 2:28PM (UTC)
dosssman commented 2 years ago

Is there any benchmark run for this ? How does it perform ? Anyways, great job @timoklein

timoklein commented 2 years ago

The results of the original paper are in https://github.com/vwxyzjn/cleanrl/issues/266. There's quite a number of games where its performance at 100k doesn't differ much from a random agent, e.g. Frostbite. I don't think it makes much sense evaluating those.

Right now I just ran it a couple of times manually to see that the code actually works. A modified version of this codebase has been able to solve Catcher (PyGame) and some simple Minigrid environments.

In general, I'm going to try and find some good hyperparameters and then running it on a few environments where performance actually differs from a random agent. Don't know for sure when I have time for that though... Once that's done, I'll put a report in here. Maybe I'm going to run it for 200k steps also to verify the results.

After that plan on starting with some Docs :)

Thanks for helping out with this @Howuhh and @dosssman !

timoklein commented 2 years ago

Posting an update: I'm running 1m step experiments on Seaquest (takes a while) currently with two versions of the algorithm: One with an implementation as close as possible to cleanrl's SAC implementation and the other which is closer to the paper.
The main difference is the update frequency: CleanRL does a critic update for every learning step, delays the actor updates but compensates for delayed actor updates. As far as I understand SAC-discrete does a single actor and critic update every four steps, not compensating for the delay.

timoklein commented 2 years ago

I ran some experiments on MsPacman and Seaquest.

Here's a link to a report with some results. The entropy regularization coefficient $\alpha$ has a tendency to explode when training longer but I couldn't find other experiments at 1m steps to verify whether that's an issue only in my code. Maybe @Howuhh and @dosssman have an idea?

This implementation doesn't quite match the results of the paper which might be due to not using evaluation mode (i.e. deterministic policy). If it's desired I can implement a test loop evaluating a deterministic policy.

The performance does match the reported values in this other implementation I found.

Are there any other experiments you'd like me to run, e.g. specific environments or more seeds?

I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).

Are there ways I can do this without xvfb? It's not on our group's machines and I don't have sudo.

If everything's fine, I'm gonna start writing Docs :)

vwxyzjn commented 2 years ago

Hi @timoklein, thank you! The experiments look very interesting.

This implementation doesn't quite match the results of the paper which might be due to not using evaluation mode (i.e. deterministic policy). If it's desired I can implement a test loop evaluating a deterministic policy.

Unless it makes a huge difference in the reported results, I wouldn't worry about it.

Are there any other experiments you'd like me to run, e.g. specific environments or more seeds?

Thanks for running the environments for three random seeds, which is great. For my personal interest, I would like to see the results in Pong, Breakout, and BeamRider, which are the common three environments that we used to benchmark other algorithms (see example here).

Are there ways I can do this without xvfb? It's not on our group's machines and I don't have sudo.

If this is an issue, feel free to run the experiments without xvfb. The video recordings are nice to have, but I am not requiring every run to have them. For example, it's not practical to obtain videos in EnvPool, so I didn't do it.

That said, please save the model at the end of training so that we can visualize them later. Feel free to use the following code:

torch.save(agent.state_dict(), f"models/{run_name}/agent.pt")
if args.prod_mode:
    wandb.save(f"models/{run_name}/agent.pt", base_path=f"models/{run_name}", policy="now")
Howuhh commented 2 years ago

@timoklein Sometimes target entropy maybe just very high and hard to reach and the loss can explode (as alpha will grow and grow), so usually I tune a bit coefficient (which is 0.98 defualt in code right know)

Some discussion about it: https://github.com/toshikwa/sac-discrete.pytorch/issues/15#issuecomment-729547166

timoklein commented 2 years ago

@timoklein Sometimes target entropy maybe just very high and hard to reach and the loss can explode (as alpha will grow and grow), so usually I tune a bit coefficient (which is 0.98 defualt in code right know)

Some discussion about it: toshikwa/sac-discrete.pytorch#15 (comment)

Thank you very much for the link. The 0.98 is the author's default. I'll make a parameter out of that and tune it a little bit.

Thanks for running the environments for three random seeds, which is great. For my personal interest, I would like to see the results in Pong, Breakout, and BeamRider, which are the common three environments that we used to benchmark other algorithms (see example here).

Sure, running three environments is doable. Currently our server's a little busy but once that's done I can start the experiments.

That said, please save the model at the end of training so that we can visualize them later. Feel free to use the following code:

Thanks, will do!

timoklein commented 2 years ago

Tried a small set of entropy scaling factors (0.7, 0.8, 0.9, 0.98) and clearly the value of the original paper (0.98) seems rather suboptimal. See the report here.

Going forward with 0.8. After trying out the delayed update once more I'm going to do three seeds on Pong, Breakout, and BeamRider up to 4m steps.

Edit: The original SAC delayed update loop does help on Pong and Breakout but learns very slowly on BeamRider. Considering it's a 4x wall clock increase I stopped running the experiments...

timoklein commented 1 year ago

Thank you very much for the detailed feedback. I'm going work everything in.

The contribution's going a bit slow because I also have other stuff to do but the process is very valuable to me. I have some other things in my mind that I want to contribute in the future so those will hopefully go a faster once I'm used to everything :)

One quick edit: I've tried to minimize the diff (which has been possible in a lot of code). One thing where that's not really possible is in the update rule. diff

SAC-discrete does actor+critic updates every n steps compared to the delayed actor updates in sac_continuous. That's a notable difference that has performance implications: If you update the actor too much (e.g. with the delayed updates from sac_continuous) the policy degenerates after a while.

vwxyzjn commented 1 year ago

Everything looks good on my end. @dosssman would you mind redoing the review for the PR given the recent changes to the algorithm and the documentation?

Chulabhaya commented 1 year ago

Hi @timoklein ! This is awesome work! I was wondering, have you tested this on a non-Atari discrete environment? I was attempting to run this on discrete Lunar Lander and it was struggling to learn, so I wasn't sure if that was an error made on my involving hyperparameters or model architectures when going from the convolutional to an MLP architecture. Thanks!

vwxyzjn commented 1 year ago

Executed some commands from #307

python -m cleanrl_utils.rlops_tags \
    --add pr-270 \
    --source-tag latest \
    --filters 'sac_atari' \
    --wandb-project-name cleanrl \
    --wandb-entity openrlbenchmark
python -m cleanrl_utils.rlops_tags \
    --remove rlops-pilot \
    --source-tag pr-270 \
    --filters 'sac_atari' \
    --wandb-project-name cleanrl \
    --wandb-entity openrlbenchmark
python -m cleanrl_utils.rlops_tags \
    --add v1.0.0b1-42-g4915e4c \
    --source-tag latest \
    --filters 'sac_atari' \
    --wandb-project-name cleanrl \
    --wandb-entity openrlbenchmark
python -m cleanrl_utils.rlops_tags \
    --add v1.0.0b1-43-g6f7251f \
    --source-tag latest \
    --filters 'sac_atari' \
    --wandb-project-name cleanrl \
    --wandb-entity openrlbenchmark

Then I did the following:

  1. manually removes the latest tag from the runs that use 0.8 as entropy coefficient
  2. remove the v1.0.0b1-43-g6f7251f tag from the runs that use 0.8 as entropy coefficient
  3. remove the v1.0.0b1-42-g4915e4c tag from the runs that use 0.88 as entropy coefficient

As a result, the

python -m cleanrl_utils.rlops --wandb-project-name cleanrl \ --wandb-entity openrlbenchmark \ --filters 'sac_atari?tag=pr-270&tag=v1.0.0b1-43-g6f7251f' \ --env-ids BreakoutNoFrameskip-v4 PongNoFrameskip-v4 BeamRiderNoFrameskip-v4 \ --output-filename compare.png --scan-history --report

Clone the report to the openrlbenchmark/openrlbenchmark name space and manually override the runset names to be CleanRL's sac_atari.py.

https://wandb.ai/openrlbenchmark/openrlbenchmark/reports/Atari-CleanRL-s-sac_atari-py--VmlldzoyOTYzNjg3

timoklein commented 1 year ago

Hi @timoklein ! This is awesome work! I was wondering, have you tested this on a non-Atari discrete environment? I was attempting to run this on discrete Lunar Lander and it was struggling to learn, so I wasn't sure if that was an error made on my involving hyperparameters or model architectures when going from the convolutional to an MLP architecture. Thanks!

Hey there! I've personally used it on two other environments: Minigrid and Catcher.

On minigrid, I'm using the RIDE-encoder with target_network_frequency: 20000, target_entropy_scale: 0.8 (for an action space where useless actions are removed) and update_frequency: 2.

I haven't used it in non-pixel environments yet. In general I found the algorithm to be sensitive to the target_entropy_scale. If you are enforcing a more uniform policy (i.e. high entropy scale) but the action space contains actions that are rarely needed or even never needed, then it won't properly learn.

I'll try a little bit as well.

araffin commented 1 year ago

tagging myself @araffin for review (later this week probably)

dosssman commented 1 year ago

@timoklein Here is a snippet that would address @araffin's comments: https://github.com/vwxyzjn/cleanrl/pull/270#discussion_r1031332675

why do you keep dim here as you flatten it in the next line?

https://github.com/vwxyzjn/cleanrl/pull/270#discussion_r1031337608

it's a bit weird to call it qf1_pi here as it doesn't depends on the actor.

To make it even more explicit, I guess you could even do a min_qf_pi.detach().

https://github.com/vwxyzjn/cleanrl/pull/270#discussion_r1031338611

you could re-use log_pi, action_probs from before, no? (don't forget the detach() in that case)

It essentially reduces the amount of calls to the Q network and the policy (especially if using entropy autotuning) by re-using the first call and leveraging .detach() to stop gradients where necessary.. This will also require to use a joint loss total_loss = actor_loss + qf_loss so that .backward() is called only once (otherwise it will probably throw an gradient was at version XX but ... error. However, it might make the code a bit harder to understand for beginners though, so not sure if this is preferable to the current method. Any thoughts @vwxyzjn regarding the complexity of this factored variant ?

if global_step > args.learning_starts and if global_step % args.update_frequency == 0:
    data = rb.sample(args.batch_size)
    # CRITIC training
    with torch.no_grad():
        _, next_state_log_pi, next_state_action_probs = actor.get_action(data.next_observations)
        qf1_next_target = qf1_target(data.next_observations)
        qf2_next_target = qf2_target(data.next_observations)
        # we can use the action probabilities instead of MC sampling to estimate the expectation
        min_qf_next_target = next_state_action_probs * (
            torch.min(qf1_next_target, qf2_next_target) - alpha * next_state_log_pi
        )
        # adapt Q-target for discrete Q-function
        min_qf_next_target = min_qf_next_target.sum(dim=1)
        next_q_value = data.rewards + (1 - data.dones) * args.gamma * (
            min_qf_next_target
        ) # NOTE: double check the output dimension

    # use Q-values only for the taken actions
    qf1_values = qf1(data.observations)
    qf2_values = qf2(data.observations)
    qf1_a_values = qf1_values.gather(1, data.actions.long()).view(-1)
    qf2_a_values = qf2_values.gather(1, data.actions.long()).view(-1)
    qf1_loss = F.mse_loss(qf1_a_values, next_q_value)
    qf2_loss = F.mse_loss(qf2_a_values, next_q_value)
    qf_loss = qf1_loss + qf2_loss

    # ACTOR training
    _, log_pi, action_probs = actor.get_action(data.observations)
    min_qf_values = torch.min(qf1_values, qf2_values)
    # no need for reparameterization, the expectation can be calculated for discrete actions
    # NOTE regarding joint loss optim: detaching qf values.
    actor_loss = (action_probs * ((alpha * log_pi) - min_qf_values.detach())).mean()

    # Loss for joint actor and network training
    total_loss = qf_loss + actor_loss

    if args.autotune:
        # use action probabilities for temperature loss
        # NOTE regarding joint loss optim: detaching log pi fro the policy
        alpha_loss = (action_probs * (-log_alpha * (log_pi .detach()+ target_entropy))).mean()

        total_loss += alpha_loss

        a_optimizer.zero_grad()

    # NOTE: call .backward() only once to compute gradients for all the components
    q_optimizer.zero_grad()
    actor_optimizer.zero_grad()
    total_loss.backward()
    q_optimizer.step()
    actor_optimizer.step()

    if args.autotune:
        a_optimizer.step()
        alpha = log_alpha.exp().item()

Hope it helps.

araffin commented 1 year ago

This will also require to use a joint loss total_loss = actor_loss + qf_loss so that .backward() is called only once

you could detach action_probs too and no joint loss is needed then?

dosssman commented 1 year ago

Indeed.

Unlike the sac continuous, the output of the policy ia not fed through the Q functions for the actor loss, so joint loss optimization will not be required.


From: Antonin RAFFIN @.> Sent: Thursday, November 24, 2022 8:05:37 PM To: vwxyzjn/cleanrl @.> Cc: Rousslan F.J. Dossa @.>; Mention @.> Subject: Re: [vwxyzjn/cleanrl] SAC-discrete implementation (PR #270)

This will also require to use a joint loss total_loss = actor_loss + qf_loss so that .backward() is called only once

you could detach action_probs too and no joint loss is needed then?

— Reply to this email directly, view it on GitHubhttps://github.com/vwxyzjn/cleanrl/pull/270#issuecomment-1326295193, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFB5WUCDYYZNJMPVIQATEXDWJ5DYDANCNFSM576HZHDA. You are receiving this because you were mentioned.Message ID: @.***>

timoklein commented 1 year ago

Thanks for the input @araffin !

I'm going to produce a version based on @dosssman 's snippet.

min_qf_next_target = min_qf_next_target.sum(dim=1)
        next_q_value = data.rewards + (1 - data.dones) * args.gamma * (
            min_qf_next_target
        ) # NOTE: double check the output dimension

You either have to flatten or keep the dim in the sum to not do an outer product. I'll do flatten because it minimizes the diff between this and sac_continuous. Same reason is why I'm going to not opt for the joint loss.

Since there's been a bunch of changes now (e.g. diff minimization, https://github.com/vwxyzjn/cleanrl/pull/323 and now these) I'll also re-run the experiments to make sure that we have proper results. That's gonna take over the weekend at least.

timoklein commented 1 year ago

@araffin @dosssman I've ran some more experiments over the last week and here's some findings:

Below are some examples. Blue gets new values for the alpha loss with

with torch.no_grad():
    _,  log_pi, action_probs = actor.get_action(data.observations)

Grey and yellow re-use the actor values and with .detach(). Obviously that's not a thorough ablation, but yellow and blue use the otherwise exact same configuration/code with torch.deterministic, so the result should be exactly the same. SAC_d_re_use_alpha

timoklein commented 1 year ago

The codespell check complains about the name "Jimmy Ba". It doesn't seem to have inline ignores (issue), so I'm not quite sure how to handle it.

TODOS:

dosssman commented 1 year ago

eps=1e-4 for Adam is required. Without this, there are seeds where SAC-d doesn't learn at all on Pong. This setting is also used in the author's codebase. Has cost me a while to figure this one out...

Nice catch. That is quite an innocuous implementation detail which actually has a drastic impact.

  • Re-using log_pi and action_probs for the alpha loss will result in slower learning (but increase throughput considerably). I would therefore like to keep the code as it its. This also keeps it more close to the continuous SAC implementation.

Fair enough. We can review this if we go over SAC algorithm in general at a later point.

  • One could use no_grad/detach for the Q-values used in the actor loss. This yields a 5% speedup in terms of SPS with minuscule effects on learning on Pong. The original codebase doesn't use no_grad (see here). I don't think 5% is worth deviating from the reference implementation. Re-using Q-values from the critic loss is a bad idea and hurts learning speed a lot.

Interesting. Could you please share the code for the variant that re-uses Q networks ?

timoklein commented 1 year ago

Nice catch. That is quite an innocuous implementation detail which actually has a drastic impact.

Indeed. I know some papers who go about tuning it, most don't. Rainbow also has a non-standard setting afaik (1.5e-4). Found a discussion here, though it's not particularly aimed at RL.

Interesting. Could you please share the code for the variant that re-uses Q networks ?

Looks like this. Fetching new values with no_grad is not an issue.

# CRITIC training
with torch.no_grad():
    _, next_state_log_pi, next_state_action_probs = actor.get_action(data.next_observations)
    qf1_next_target = qf1_target(data.next_observations)
    qf2_next_target = qf2_target(data.next_observations)
    # we can use the action probabilities instead of MC sampling to estimate the expectation
    min_qf_next_target = next_state_action_probs * (
        torch.min(qf1_next_target, qf2_next_target) - alpha * next_state_log_pi
    )
    # adapt Q-target for discrete Q-function
    min_qf_next_target = min_qf_next_target.sum(dim=1)
    next_q_value = data.rewards.flatten() + (1 - data.dones.flatten()) * args.gamma * (min_qf_next_target)

# use Q-values only for the taken actions
qf1_values = qf1(data.observations)
qf2_values = qf2(data.observations)
qf1_a_values = qf1_values.gather(1, data.actions.long()).view(-1)
qf2_a_values = qf2_values.gather(1, data.actions.long()).view(-1)
qf1_loss = F.mse_loss(qf1_a_values, next_q_value)
qf2_loss = F.mse_loss(qf2_a_values, next_q_value)
qf_loss = qf1_loss + qf2_loss

q_optimizer.zero_grad()
qf_loss.backward()
q_optimizer.step()

# ACTOR training
_, log_pi, action_probs = actor.get_action(data.observations)
# NO NEW VALUES HERE!
# qf1_values = qf1(data.observations)
# qf2_values = qf2(data.observations)
min_qf_values = torch.min(qf1_values, qf2_values).detach()
# no need for reparameterization, the expectation can be calculated for discrete actions
actor_loss = (action_probs * ((alpha * log_pi) - min_qf_values)).mean()

It's a bit hand-wavy, but my best explanation is that not using up-to-date values will result in steps that are slightly off. Over time, these errors accumulate and throw off training. I would assume the effect is more pronounced in later stages on Pong due to the environment structure: No or slow progress at the start, then the agent figures out the environment and rapidly solves it.

dosssman commented 1 year ago

It's a bit hand-wavy, but my best explanation is that not using up-to-date values will result in steps that are slightly off. Over time, these errors accumulate and throw off training. I would assume the effect is more pronounced in later stages on Pong due to the environment structure: No or slow progress at the start, then the agent figures out the environment and rapidly solves it.

Well, that might be one reason. Although intuitively, it does not seem like just "one Q network" update would affect the critic score used to update the policy. The difference in outputs of Qf1 and QF2 before and after one update step are probably marginal.

In any case, I concur we can probably leave the default implementation to stay consistent with the reference implementations, as well as the SAC continuous variant. Later, it might be interesting to investigate that aspect a bit more, albeit on a simpler task without pixel-observations first, I think.

Great work. Overall looking good to me.

timoklein commented 1 year ago

@dosssman

Monday's results left me no peace because they were just run on a single seed. So I did actually run some more experiments considering the different types of update rules. (Not) Surprisingly, the results changed:

sac_d_update_ablations

Right now I'm running 5M experiments for the openrlbenchmark with nograd_Q_reuse_alpha. Let's see what happens on the other games since I only looked at Pong. It's probably also game-dependent but since this version is also the fastest one, I'll keep it.

Thanks for all the input and ideas! Once the experiments are done, I'm going to update the results and remaining todos and we can get this wrapped up :)

Chulabhaya commented 1 year ago

Hey all! I was wondering, could anyone explain why exactly the target_entropy is calculated as -scaling * torch.log(1 / torch.tensor(envs.single_action_space.n))? In the original SAC paper they calculate target_entropy as -torch.tensor(envs.single_action_space.n). What is the mathematical significance of the difference between these two?

Just from my basic knowledge I assumed maximum entropy was calculated as either torch.log(torch.tensor(envs.single_action_space.n)) or -torch.log(1 / torch.tensor(envs.single_action_space.n)) (which is the formulation used in this implementation/discrete SAC paper).

Just looking to better understand how this value works, thanks in advance!

dosssman commented 1 year ago

@Chulabhaya

Original SAC implementation was developed for continuous action case, while this one is for discrete action, hence the difference in computing target entropy. Furthermore, in both case target entropy is a "heuristic".

Maybe following down this thread can bring some clarity: https://github.com/vwxyzjn/cleanrl/pull/270#discussion_r1031334739

dosssman commented 1 year ago

@timoklein Thanks a lot for the detailed experiments. Indeed, running experiment with more than one seeds is critical. Sometimes, some seed can lead to "degenerate" results that are not really related to the algorithm itself.

reuse_Q: Re-use the Q-values used in the critic loss for the actor loss with .detach() (see the post above). Is even faster at ~96 long-run SPS. Problem here is that it destabilizes training: One seed doesn't learn at all.

From the plot, I notice that there is a large variance in the reuse_Q variant. I suspect there might be one or two seed that are messing up the overall result, so it might be worth checking the seeds one by one.

Overall, I think re-using at least log_pi for alpha optimization (if applicable) is a reasonable trade-off (should also be doable on the continuous variant).

timoklein commented 1 year ago

@dosssman

Yes, re-using log_pi and the action probs is what I'm doing now.

Indeed, most of high variance and bad performance of the reuse_Q method comes from a single seed (seed=1). Removing it makes the method competitive, but arguably still worse to the version I'm currently using. As it's the only update variant failing to learn on this seed, we might also find stability issues on other seeds that I haven't evaluated.

cleanRL_sacd_removed_seed

timoklein commented 1 year ago

From my point of view this is done now. I ran the new experiments and updated all plots and results. I also deleted all runs that aren't used for the final results from the openRL benchmark project. Docs should also work.
Now surely there'll be some inevitable bugs but then just ping me in the issue and I'm gonna try to fix them :)

dosssman commented 1 year ago

Great contribution ! I think we can proceed to merging this as it matches baseline results while maintaining a relatively simple implementation. Other issues, if any exists, will come in light as other people try it out from a fresher perspective.

timoklein commented 1 year ago

minor: the title for the pong figure is not the right one

Thanks, good catch!

timoklein commented 1 year ago

Anything more to do here for me? If this is blocking

The codespell check complains about the name "Jimmy Ba". It doesn't seem to have inline ignores (https://github.com/codespell-project/codespell/issues/1212), so I'm not quite sure how to handle it.

I can just remove the citation so that the CI error is gone.

timoklein commented 1 year ago

Everything LGTM. Feel free to merge after you have resolved the minor target-entropy-scale issue. You should already have contributor access. Thanks so much for this contribution, and sorry for the delay.

No problem. It's been a fun experience and I learned a lot. Looking forward to contributing more in the future!

timoklein commented 1 year ago

@vwxyzjn I'm not quite sure why this still fails https://github.com/vwxyzjn/cleanrl/actions/runs/3912194839/jobs/6686505080#step:4:92

"Ba" should be correctly added to the pre-commit https://github.com/vwxyzjn/cleanrl/blob/c3fc57db7d599fc0e531c932f2cc916eceedea4a/.pre-commit-config.yaml#L38

I'm going to fix it but it might be Sunday before I get around to doing it.

EDIT: Probably a capitalization issue https://github.com/codespell-project/codespell/issues/2137