Open blumu opened 2 years ago
@sherdencooper These are great points. Regarding 1.: that's correct, clipping negative rewards was a recommendation made by a RL researcher who advised the project. It did help with the learning as you pointed out. That said your proposed changes make sense and we'd be happy to integrate them if you were to submit a PR for it. Question: You are seeing a clear improvement for epsilon greedy with your changes. Do you also see improvements for the D-QL agent?
In my experiment, I trained an agent with the original reward design in the chain env. The agent can perfectly take ownership of the network in training. When I saved the model and evaluate it with epsilon-greedy, the success rate is only about 90%. When I patched the two points I proposed above and trained an agent with the same parameters, the successful rate for evaluation is about 100%. I think the original reward design makes the agent overfit.
Could you please take a look at the two points and give some feedback? Anyway, thanks again for your codes, it helps with my research, and I even would like to use them in my next research project about online learning :)
Originally posted by @sherdencooper in https://github.com/microsoft/CyberBattleSim/issues/46#issuecomment-1136458981