kevslinger / DTQN

Deep Transformer Q-Networks for Partially Observable Reinforcement Learning
MIT License
130 stars 21 forks source link

Reproduction of GridVerse results #2

Closed MarcoMeter closed 1 year ago

MarcoMeter commented 1 year ago

Hi!

Thanks for sharing your intriguing ideas on how to setup a transformer-based memory DRL algorithm. I'm interested in the way how the interface of the transformer works during inference and optimization. So I started out by simply trying to reproduce your GridVerse results as stated in your readme.

I'm currently running 3 repetitions of this experiment:

python run.py --env gv_memory.7x7.yaml --inembed 128 --disable-wandb --verbose

The success rate stays zero for the entire training so far.

[ December 15, 14:00:03 ] Training Steps: 699000, Success Rate: 0.00, Return: -25.00, Episode Length: 500.00, Hours: 3.77
[ December 15, 14:00:23 ] Training Steps: 700000, Success Rate: 0.00, Return: -25.00, Episode Length: 500.00, Hours: 3.78
[ December 15, 14:00:42 ] Training Steps: 701000, Success Rate: 0.00, Return: -25.00, Episode Length: 500.00, Hours: 3.78
[ December 15, 14:01:02 ] Training Steps: 702000, Success Rate: 0.00, Return: -25.00, Episode Length: 500.00, Hours: 3.79

I'm pretty sure I missed something. It would be great if you could help.

edit: Training on 5x5 looks pretty volatile in comparison to the reported results.

[ December 15, 14:10:52 ] Training Steps: 891000, Success Rate: 0.30, Return: -2.24, Episode Length: 4.90, Hours: 3.92
[ December 15, 14:11:04 ] Training Steps: 892000, Success Rate: 0.70, Return: 1.78, Episode Length: 4.40, Hours: 3.92
[ December 15, 14:11:17 ] Training Steps: 893000, Success Rate: 0.40, Return: -1.20, Episode Length: 4.00, Hours: 3.92
[ December 15, 14:11:30 ] Training Steps: 894000, Success Rate: 0.70, Return: -0.42, Episode Length: 48.40, Hours: 3.93
kevslinger commented 1 year ago

Hi Marco,

Thanks for your interest in DTQN! I'm sorry you're having issues. Can you let me know what branch you're using to run these experiments? I recently merged some (experiment-)breaking changes to the main branch, and have been working this week on fixing that. The paper branch is the most stable -- I have kept that branch intentionally frozen to be consistent with my paper.

As for the volitility you see, the output from --verbose is the result of a single evaluation step (10 episodes). All the results figures in my paper are smoothed across 10 evaluation steps, which I think is more readable and preserves overall performance.

MarcoMeter commented 1 year ago

I knew I missed something. Thanks for the heads up. I'll re-run the experiments using the paper branch.

kevslinger commented 1 year ago

Sounds good!

MarcoMeter commented 1 year ago

dtqn_7x730.log dtqn_5x510.log dtqn_5x520.log dtqn_5x530.log dtqn_7x710.log dtqn_7x720.log

The results look more plausible now, thanks again. During evaluation, does each evaluation iteration utilize the same set of seeds to run the episodes?

edit: I just took some time to step through your code using the debugger. One detail puzzles me. Your context vectors are initialized with distinct constant values. For instance, every item of the observation context vector is initialized with [22, 22, 22, 22, 22, 22] (GridVerse 5x5). To my understanding one buffer item stores the entire context for one timestep. If the episode length is shorter than the context length, some observations in the context are set, while the other ones comprise the initial values. I consider these values as padding, correct? In that case, aren't you supposed to add a key_padding_mask (forward attention) or to mask out these paddings during loss computation? Otherwise these paddings could hurt the optimization process.

kevslinger commented 1 year ago

Glad to hear you're able to get things working!

When we start the runs, we create 2 environments: 1 for the data collection and one for evaluation. Both are initialised with the same seed.

I initialise the observation mask (padding) to be some value that won't be seen in the environment's natural observations. During data collection and evaluation, I trim the agent's context to remove the padding. But you're right, during training, this padding could hurt the optimisation process.

We use (what I call) intermediate q-value prediction (IQP), which is, I think, one technique which helps alleviate this potential issue. The way it works. Suppose, for simplicity, we have a context length of 5, but our episode only took 3 steps. Then our context in the replay buffer would be [obs1, obs2, obs3, pad, pad]. The output of DTQN for a batch [batch_size, context_len, obs_len] is [batch_size, context_len, num_actions], so each subhistory in the context gets a vector of Q-values associated with the value of taking each action given that subhistory. In our simple example, that would be:

[obs1] -> Q-values
[obs1, obs2] -> Q-values
[obs1, obs2, obs3] -> Q-values
[obs1, obs2, obs3, pad] -> Q-values
[obs1, obs2, obs3, pad, pad] -> Q-values

We train using all those generated Q-values (using a DQN loss). Therefore, we still get the "correct" training signal (namely, [obs1, obs2, obs3] -> Q-values. But I agree that this is not perfect and could be improved. A while back, I tried to play around with using key padding masks, but found them to be very unstable (my loss and gradient norms were skyrocketing). But I definitely think this is worth taking a look at again. Masking out the paddings during loss computation could be the right way to do it. Do you have any suggestions or pointers? Thanks

MarcoMeter commented 1 year ago

So I assume that the environment "levels" are different for each evaluation run. It would be beneficial to have a fixed set of environment seeds for evaluation. This should be helpful to compare the agent's performance at different time steps as the agent is evaluated under the same conditions.

Concerning the removal of paddings during loss computation, you could add a loss mask (bool tensor) to the experience tuple that is added to the ReplayBuffer. This tensor can be naively used to index the actual data and hence drop the padding.

loss_mask = tensor(True, True, True, True, False, False) padded = tensor(1, 2, 3, 4 ,0 , 0) pads_removed = padded[loss_mask] pads_removed -> tensor(1, 2, 3, 4)

Based on my experience with recurrent PPO, the loss computation affected by padding is an issue on more complex environments. When training CartPole (masked velocity) the loss padding issue is not really apparent, but once training is done on MiniGrid-Memory (visual observation 84x84x3) the agent gets in trouble.

kevslinger commented 1 year ago

Yes, the "levels" are randomised on each reset. We use the same seeds when comparing across algorithms, so each algorithm is evaluated on the same environments. I'm not quite sure I understand how beneficial fixing the seeds for environments would actually be. We want to test our agent's to generally solve the environment, not solve a fixed sequence of episodes. That sounds like it could promote overfitting. Maybe I am misunderstanding.

Thanks for the padding example, I will take at using this soon! Padding is a big pain to deal with when trying to solve POMDPs

MarcoMeter commented 1 year ago

Concerning environment levels, I'm just referring to the evaluation environment and not the training one. In your case, the first evaluation pass evaluates on different levels compared to later passes within one training run.

kevslinger commented 1 year ago

It might be interesting to compare how an agent performs on the same 10 levels during each evaluation step vs how it performs on 10 random levels during each evaluation step. On something like gridverse, the number of levels is probably small enough that it wouldn't make a huge difference. But perhaps for other domains with more randomisable features