mimoralea / gdrl

Grokking Deep Reinforcement Learning
https://www.manning.com/books/grokking-deep-reinforcement-learning
BSD 3-Clause "New" or "Revised" License
798 stars 231 forks source link

Chapter 9 load experiences issue #19

Closed SubaruSpirit closed 2 years ago

SubaruSpirit commented 2 years ago

In your DQN class, optimize_model method, you read in the experiences in the below order.

states, actions, rewards, next_states, is_terminals = experiences

However in the FCQ class, load method, you load the experiences in the below order.

states, actions, new_states, rewards, is_terminals = experiences

So you have mixed up new_states and rewards. My advice is to swap new_states and rewards in the FCQ class, load method to keep it consistent with the rest of the code. I found out this issue when I'm trying to look at the new_states just to find out my new_states are actually rewards, then I found out this consistency issue.

mimoralea commented 2 years ago

Great find. Will fix this weekend.

mimoralea commented 2 years ago

The super interesting thing about RL is, why in the world does this agent learn, then?

Outstanding catch, BTW.

SubaruSpirit commented 2 years ago

Wow, never expected you to reply, will leave a good review to your book now as this is valuable for the readers if there are issues in your book or something failed. I do have an issue with your book in Kindle version as the code are in a weird format that are very hard to read.

Let's get back to the question now. I'm honestly surprised why this agent learns, I actually implemented it in MountainCar environment with your code as the base code, it works as well (though I have to change the reward to make sure the agent doesn't get stuck at the bottom)! Then later on when I tried DQN on MountainCarContinuos-v0, it threw me an error, then I tried to find out where went wrong, and I spotted this error. But the funny story is that this mismatch is not the cause for the error, and my agent actually learns well in MountainCar-v0 environment with the mismatch issue. Would be great if you can find out why it still works, even with the mismatch... I will give an update here if I could find the reason as well.

mimoralea commented 2 years ago

Okay, so I think it makes sense. It learns because I do nothing in that function other than load the experiences into the device (either CPU or GPU).

    def load(self, experiences):
        states, actions, new_states, rewards, is_terminals = experiences
        states = torch.from_numpy(states).float().to(self.device)
        actions = torch.from_numpy(actions).long().to(self.device)
        new_states = torch.from_numpy(new_states).float().to(self.device)
        rewards = torch.from_numpy(rewards).float().to(self.device)
        is_terminals = torch.from_numpy(is_terminals).float().to(self.device)
        return states, actions, new_states, rewards, is_terminals

So, in other words, I name the rewards "new_states" and the new_states "rewards," then load them into the device, which doesn't care about the order, finally return the variables in the same order (preserving the correct order). Later, in the optimize function, we load them properly: states, actions, rewards, next_states, is_terminals = experiences.

Right?

Either way, I will clean that up, to avoid confusion.

SubaruSpirit commented 2 years ago

That's correct! Great find :) Thanks for fixing it

mimoralea commented 2 years ago

To you for reporting the issue. Will merge soon. Thanks!