Closed nosound2 closed 2 years ago
No good reason. I've changed it in my personal training scripts now to not reward square it like this. I will fix this logic to be a better example. The reward function should also be a delta to a reward, not a total reward of current state.
I was under the false impression that I if I didn't do that, I would be rewarding fewer units/cities. That reward logic gives a reward only at the start of each turn, as the number of actions per turn increases from more units and cities, this means more sparse reward so less reward on average per action. I think what matters is the mean episode total reward instead, so it doesn't need this scaling.
Looking forward to see how the current run will do (as of right now the kaggle notebook indicates that it's running with the updated reward function).
Thinking about your comment and the initial question: the example code is more a state evaluation used as a reward than a direct reward for the turn actions.
With that reward function, around a ~12.0 episode mean reward corresponds to around a ~800 public score when set to Deterministic=False
inference mode. Honestly, I've been spending a ton of time on vision CNN RL designs, but still that super-basic example miner is doing better. I wonder if maybe the way to go is to instead expand upon a simple design like that with advanced action spaces and customized non-vision observations. Still continuing down the vision track for now, just some food for thought!
An argument in favor of the CNN approach is the success of the imitation code that has been posted. I'm going down that path too at the moment.
@glmcdona I feel so frustrated right now.
agent_policy
. I thought I made some clever changes but I forgot to use all the "clever changes" during the long training. So it's good but thanks to a double mistake or something. Let's called it v1
.v2
.v2
largely outperforms v1
, win rate > 60%, but in the leaderboard v1
is still above.I'm sure you both had similar issues, sorry for venting :-)
@royerk , I only dream to have your problems, my best bot is doing 540 =)
Probably >60% win rate is not yet big enough to be reflected in the LB fast. I saw a discussion on the forum that some people complain that their agents don't get their expected scores for days. Anyway, good luck, dont look at the LB too much ;)
I added an update to the example training script to the new reward function in #86, along with a couple other small updates.
@royerk, mean reward of ~16 is really good. Did you set it to deterministic or non-deterministic inference? My best one at around reward of 12 mapping to 800 rank is running on deterministic=False
. You may want to check that?
... on
deterministic=False
. You may want to check that?
It's set to False
, I have adopted that since your comment in a PR (thanks again).
In get_reward function the calculations is this
What is the reason for having it squared?