[Question] Reward calculation in example agent

glmcdona / LuxPythonEnvGym

Matching python environment code for Lux AI 2021 Kaggle competition, and a gym interface for RL models.

MIT License

73 stars 38 forks source link

[Question] Reward calculation in example agent #83

Closed nosound2 closed 2 years ago

nosound2 commented 2 years ago

In get_reward function the calculations is this

reward_state = city_tile_count * 0.01 + unit_count * 0.001
return reward_state * (city_tile_count + unit_count)

What is the reason for having it squared?

glmcdona commented 2 years ago

No good reason. I've changed it in my personal training scripts now to not reward square it like this. I will fix this logic to be a better example. The reward function should also be a delta to a reward, not a total reward of current state.

I was under the false impression that I if I didn't do that, I would be rewarding fewer units/cities. That reward logic gives a reward only at the start of each turn, as the number of actions per turn increases from more units and cities, this means more sparse reward so less reward on average per action. I think what matters is the mean episode total reward instead, so it doesn't need this scaling.

royerk commented 2 years ago

Looking forward to see how the current run will do (as of right now the kaggle notebook indicates that it's running with the updated reward function).

Thinking about your comment and the initial question: the example code is more a state evaluation used as a reward than a direct reward for the turn actions.

glmcdona commented 2 years ago

With that reward function, around a ~12.0 episode mean reward corresponds to around a ~800 public score when set to Deterministic=False inference mode. Honestly, I've been spending a ton of time on vision CNN RL designs, but still that super-basic example miner is doing better. I wonder if maybe the way to go is to instead expand upon a simple design like that with advanced action spaces and customized non-vision observations. Still continuing down the vision track for now, just some food for thought!

royerk commented 2 years ago

An argument in favor of the CNN approach is the success of the imitation code that has been posted. I'm going down that path too at the moment.

royerk commented 2 years ago

@glmcdona I feel so frustrated right now.

My current best bot is ~850, I was based on the initial agent_policy. I thought I made some clever changes but I forgot to use all the "clever changes" during the long training. So it's good but thanks to a double mistake or something. Let's called it v1.
I trained a new one with your update. Mean reward ~16 and it sits at ~814 in the leaderboard. Let's call it v2.
Locally v2 largely outperforms v1, win rate > 60%, but in the leaderboard v1 is still above.

I'm sure you both had similar issues, sorry for venting :-)

nosound2 commented 2 years ago

@royerk , I only dream to have your problems, my best bot is doing 540 =)

Probably >60% win rate is not yet big enough to be reflected in the LB fast. I saw a discussion on the forum that some people complain that their agents don't get their expected scores for days. Anyway, good luck, dont look at the LB too much ;)

glmcdona commented 2 years ago

I added an update to the example training script to the new reward function in #86, along with a couple other small updates.

@royerk, mean reward of ~16 is really good. Did you set it to deterministic or non-deterministic inference? My best one at around reward of 12 mapping to 800 rank is running on deterministic=False. You may want to check that?

royerk commented 2 years ago

... on deterministic=False. You may want to check that?

It's set to False, I have adopted that since your comment in a PR (thanks again).