Open MicheleMusacchio opened 4 months ago
I tried running for 1000 episodes with updating the agents every 100 steps and a lower embedding dimension (256), but no good results.
There is no upward trends in the rewards, just some spikes here and there.
We always converge very quickly and then don't improve.
Same as the actor loss, quick convergence and then no improvement.
Some questions I have:
Policy Network: The policy network has a similar structure as the observation-action encoder, which uses an attention module over the entities of each type in the observation oi to adapt to the changing population during training. The only difference in this network is that the action is not included in the input. Notably, we do not share parameters between the Q function and the policy.
I got my first ALL DONES (i.e. last agent arrived to his plate)
cc: @jonasbarth
Even after a bigger run, agents don't learn: according to the pressurplate we have a reward in [-0.9,0] if the agent is in the same room of the assigned plate and reward [-1,...,-N] otherwise. I tried to implement the rendering but something wrong was happening with the pressureplate repo, but we can understand where the agent is based on the rewards and I saw that for a big run the agents are stuck in the first room. It might be for an incorrect implementation or at theory level since the actions are discrete and the gumbel softmax is not enough to face the problem. We should investigate better