medipixel / rl_algorithms

Structural implementation of RL key algorithms
https://www.medipixel.io/
MIT License
506 stars 63 forks source link

Loss jumps to a big number and then zero (Rainbow DQN) #270

Closed pedrolob closed 3 years ago

pedrolob commented 4 years ago

I've working with the code in differences environments like pong and another NES games. But almost all the time I see the same pattern, the loss goes down normally but after some point it jumps to a very big value and then goes to zero. After that, it doesn't converge. Is it a catastrophic forgetful or is something else? Could improve with parameters tuning or is a bug in the code? I tried a lot of parameters for a week but nothing changes, neither improvement nor worse. What else can I do?

jinPrelude commented 4 years ago

Thanks for using our repo! Unfortunately, we only tested in the environment mentioned in README. And based on our several performance test of those environments and our own environment, we determined that our algorithms works pretty well. In certain environments, C51 has a higher performance than IQN, so various attempts are recommended including hyper-parameter tuning. We are sorry that we cannot help you further with your problem, and we will look forward to your good news.

pedrolob commented 4 years ago

Thanks for the answer. Is there any config file for C51?

jinPrelude commented 4 years ago

We use C51 for training lunarlander(https://github.com/medipixel/rl_algorithms/blob/master/configs/lunarlander_v2/dqn.py). If you want to train C51 in pong, you can train it by replacing pong_dqn's IQN head to lunarlander_dqn's c51 head, and fix pong_dqn's loss_type from "IQNLoss" to "C51Loss". This is what pong_c51 config looks like:

"""Config for C51 on Pong-No_FrameSkip-v4.
- Author: Kyunghwan Kim
- Contact: kh.kim@medipixel.io
"""
from rl_algorithms.common.helper_functions import identity

agent = dict(
    type="DQNAgent",
    hyper_params=dict(
        gamma=0.99,
        tau=5e-3,
        buffer_size=int(1e4),  # openai baselines: int(1e4)
        batch_size=32,  # openai baselines: 32
        update_starts_from=int(1e4),  # openai baselines: int(1e4)
        multiple_update=1,  # multiple learning updates
        train_freq=4,  # in openai baselines, train_freq = 4
        gradient_clip=10.0,  # dueling: 10.0
        n_step=3,
        w_n_step=1.0,
        w_q_reg=0.0,
        per_alpha=0.6,  # openai baselines: 0.6
        per_beta=0.4,
        per_eps=1e-6,
        loss_type=dict(type="C51Loss"),
        # Epsilon Greedy
        max_epsilon=0.0,
        min_epsilon=0.0,  # openai baselines: 0.01
        epsilon_decay=1e-6,  # openai baselines: 1e-7 / 1e-1
        # grad_cam
        grad_cam_layer_list=[
            "backbone.cnn.cnn_0.cnn",
            "backbone.cnn.cnn_1.cnn",
            "backbone.cnn.cnn_2.cnn",
        ],
    ),
    learner_cfg=dict(
        type="DQNLearner",
        backbone=dict(
            type="CNN",
            configs=dict(
                input_sizes=[4, 32, 64],
                output_sizes=[32, 64, 64],
                kernel_sizes=[8, 4, 3],
                strides=[4, 2, 1],
                paddings=[1, 0, 0],
            ),
        ),
        head=dict(
            type="C51DuelingMLP",
            configs=dict(
                hidden_sizes=[512],
                v_min=-10,
                v_max=10,
                atom_size=51,
                output_activation=identity,
                # NoisyNet
                use_noisy_net=True,
                std_init=0.5,
            ),
        ),
        optim_cfg=dict(
            lr_dqn=1e-4,  # dueling: 6.25e-5, openai baselines: 1e-4
            weight_decay=0.0,  # this makes saturation in cnn weights
            adam_eps=1e-8,  # rainbow: 1.5e-4, openai baselines: 1e-8
        ),
    ),
)
pedrolob commented 4 years ago

I did some testing with C51 but I didn't see any improvement. What I found is that the problem is in the Adam optimizer:

The "amsgrad=True" parameter makes the training slower, but at the end the same thing happened. One of the solutions that someone proposed is gradient clipping (that I see the librar has) and batch normalization ( I don't know how to test that). Also, I wonder if "adam_eps" is for decrease the learning rate during the training (one of the solutions proposed).

jinPrelude commented 3 years ago

We are currently using gradient clipping for training, which you can edit the gradient clipping parameter in hyper_params in config file, and batch normalization and decreasing learning rate is not applied yet. Maybe you can use torch.optim.lr_scheduler for adjusting the learning rate(https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate). It might be hard to implement those methods in master branch as soon as possible since we have other primary tasks, But we'll try it when we can afford it. Applying rl_scheduler wouldn't be hard, so maybe you could try it. Thanks for your feedback it really helped. And if you don't mind, please let me know in what environment you are training, so I could run our repo in the environment and check what the problems are in my personal time :)

pedrolob commented 3 years ago

My environment is Mario using retro gym. What I try, and seems to improve the situation, is using a bigger batch_size of 64 or 128. That helps with the training stability, and in my test I don't see the problem again. I have to test it more, but I like what I saw. I am going to give a try too to the adjust learning rate, when I have some result I will share it.

pedrolob commented 3 years ago

I've been do it more test but the problem continuous. The increase of batch size works sometime, but other not. That the problem with this issue, maybe occurs 5 minus after the training started or 5 hours, and sometimes never happens, but it's completely random. I didn't give a chance to the decreasing of learning rate because I don't think can help because of what I said before. I don't know what else to try, I test every parameter and none of them seems to improve or aggravate the situation (the last one was a very low LR). Maybe is some weird error with the version of numpy or pytorch, but really I don't know. Also I thought that maybe is an issue with my hardware, some kind of failure, unfortunately I don't have a way to prove it.

jinPrelude commented 3 years ago

I tried training IQN in mario environment(https://pypi.org/project/gym-super-mario-bros/). Performance seems to require additional tuning, but training was stable(maybe our environment could be different). Loss was definitely bigger than I usually saw in other environment, but it didn't diverge or made fatal error. I made a branch for my mario environment test here, so give it a try : https://github.com/medipixel/rl_algorithms/tree/mario. This is what training graphs looks like: mario Blue graph is CNN IQN with grayscale state & e-greddy & noisyNet, and green graph is ResNet IQN with grayscale state & noisyNet & 32 batch size. I only have tried only two types, but I think the following attempts are worth trying :

If you have other questions feel free to contact us. You can mail me for more questions.

pedrolob commented 3 years ago

I think I find the problem. When I train a game like Mario I use the change in position to reward the agent. Every positive changes in x is 1 reward. The issue seems to be in the n step algorithm that joins the rewards of 3 steps from what I saw, so you have 4 differents rewards: 0, 1, 2 and 2,97 (I don't know why). That seems to cause instability in the network and then the explode in loss occurs. What I tried was to remove n step (n_step = 1) and also clip the reward to 0.1. For now I don't see any issue but I nee to do more test in case I am having luck. It seems like having some many rewards cause some type of instability. Also, I saw your test @jinPrelude, and I notice that you used gym-super-mario-bros. In my case I use retro gym, and maybe that is the problem. Also, when I use only the score to train I see that the problem doesn't occurs neither but doesn't converge so fast like using the position. If not use use n step solve the problem maybe clipping the reward also when the experiences are joined maybe is the key.