Closed jsphon closed 7 years ago
The defect is here:
for action in range(num_actions):
next_int_ext_state = self.rl_system.model.apply_action(int_ext_state, action)
reward = self.rl_system.reward_function(int_ext_state, action, next_int_ext_state)
targets[action] = self.get_target(next_int_ext_state, action, reward)
The problem is that the next state has changed internal state, but the reward function only looks at the final state. So we need to fix the reward function.
It's not working as the ant world example won't work.
Make some tests for it.