[Discussion] The learning design

glmcdona / LuxPythonEnvGym

Matching python environment code for Lux AI 2021 Kaggle competition, and a gym interface for RL models.

MIT License

73 stars 38 forks source link

[Discussion] The learning design #88

Open nosound2 opened 3 years ago

nosound2 commented 3 years ago

I think about the learning design that is implemented here, and I just can't resolve to myself two questions. The core function for the learning is the environment step function. The chain of learning is [OBS_UNIT1 -> ACTION1 -> REWARD -> OBS_UNIT2 -> ACTION2 -> OBS_UNIT3 -> ACTION3 ... -> ALL TURN ACTIONS ARE ACTUALLY TAKEN] -> [THE SAME FOR THE NEXT TURN ...]. The questions are:

Less important. Only the first action gets reward. Doesn't it create significant problems, especially when the number of units per turn is big? Especially if the discount factor gamma is small, but also in general. Even this intermediate reward for most actions is delayed. I wonder how much harder the life is for the model because of this. One thing, - the ordering of the units to act can be important. I can imagine that the model can handle it. But is there an example of multi-unit problems that are designed like this?
More important. The algorithms like TD(0), Q-Learning, and more involved like PPO, all depend for the model update not only on the current state (or state-action pair) but also the next one. But the next step is a different unit, its observation is unit-dependent, its value function is completely different, and barely related. The process is basically not markovian, the states are heavily incomplete information, and each time different incomplete information. Isn't it a no-go? Or I miss-understand something major?

Please share your thought!

glmcdona commented 3 years ago

Some personal thoughts, I am not an expert on this:

Personally I think this part is OK because of the large gamma. In many cases it's quite common to get the reward many steps afterwards of the action that caused the reward. The default gamma is 0.995, so if my understanding is correct an action will get a 0.995^100=0.61 factor of a reward given 100 steps later. This matches up pretty closely with the OpenAI Five setup where micro-rewards are sparse with large gammas: https://openai.com/blog/openai-five/
In the case of PPO it seems to have two critic model components. It predicts the action value of each possible action given current state. I think makes sense for our scenario, it takes current unit observations and guesses how likely the actions will effect the discounted reward. But the advantage function is a critic model predicting just value given state. This second part it seems is the state value function, where like you said I think it's more of a problem. It's comparing the value of the current state versus the next step's state, and extends it into the future with the gamma discount factor. Part of the observation of the example agent was included the values like num_units, num_cities, etc to purposely help with this state value estimate, but these only change each turn (not each action) like you said. So let's say we had an action that was good, I guess the advantage value from the action would be approximately something like `(value(state current)-value(state 1 step ahead))+gammavalue(state 2 step ahead)-value(state 3 step ahead)+gamma^2...). Imagine it's action was building a city, and we get the reward for building a city ~4 steps later once the next turn starts, it'll still be included in the advantage calculation of that action despite it being a bunch of states earlier. So maybe the key here is just to make sure to include lots of good game-observations outside of the unit-observations so that the state value critic model can perform better? I had been hoping my CNN experiments where the whole map was driving the observation would work better, since it can fully-observe the state of the game at each decision - unfortunately I couldn't get it to outperform the simple models. Also worth noting is that the OpenAI Five model had incomplete state information, but not the swapping of the observed unit at each step (they had one head of the model per Dota hero, mapping this to our problem would be very hard since we have a variable number of heads we'd have to attach+detach as units get created and die).

nosound2 commented 3 years ago

Hi @glmcdona , thanks for the feedback. Regarding 1, I agree that it should be OK. For the second point, you raise an interesting point about including more global game observations. Too risky though, I wouldn't want to try to make it work.

The CNN approach is what I also want to test. Have you just used CnnPolicy or something else? I want to take the network from that famous imitation learning CNN notebook and plug it instead of that one. If I checked correctly CnnPolicy is only 4 CNN layers. Additionally, I want the value part to have the same input per turn, and give additional input (unit location) only to the action part. I will let you know if it works for me!

glmcdona commented 3 years ago

@nosound2. Yeah, I think the built-in default CnnPolicy isn't a good fit. You can define your own layers: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html

I just shared an example notebook with you on Kaggle that I've been using. I've tried a few architectures, and the latest one is inspired by that imitation learning notebook model layout. Note, although it did work, it didn't get to as high a reward solution as the simple non-CNN example in this repo. I'm personally working on implementing a solution more similar to the OpenAI Five observation setup now.

nosound2 commented 3 years ago

Ok, very interesting, I am reading your notebook now. Just a small remark, I believe theoretically it is called "private sharing", which is not allowed. Let's refrain from this in the future (as long as we are not in the same team!).

glmcdona commented 3 years ago

Oh wow, I didn't know we couldn't share code with each other if we aren't on the same team! Thanks for the heads up.

I'll get a proper run of that notebook done and share it public.

nosound2 commented 3 years ago

A few comments on that notebook

Good that you didn't use compressed_map_observation at the end ;). It is probably not good because of what we discussed.
I like how you build the observation layers.
The CNN model that you built is very strange. Three nn.Conv2d layers in a row without activations in between, no batch norm, two max pools, and none of them is at the end, no skip connections. It is far away from all designs that I know.
It is nice how you allow passing different types of observations, I will try to use it too. But do you use only self.obs['map']? For example all these global arguments, like night/day etc., can be good to concatenate to the output of CNN, instead of creating a separate layer. It seems like everything is ready for this too.

royerk commented 3 years ago

Are you on the competition discord server @nosound2 ?

Regarding the architecture and whether or not the usage of skip/residual elements. The current "miner-state' has ~100 values (order of magnitude), any output of a CNN feature extractor is likely to be >10k values. Fancy architectures are great but the training time (and hyperparameters selection) is getting quickly out of hand (at least from my attempts).

I'm currently trying to inject as much human-knowledge as it is reasonable in the observation to reduce what has to be learned from scratch to improve training speed.

glmcdona commented 3 years ago

The CNN model that you built is very strange. Three nn.Conv2d layers in a row without activations in between, no batch norm, two max pools, and none of them is at the end, no skip connections. It is far away from all designs that I know.

This is similar to a basic VGG16 model architecture, though looks like it should run a relu every 3x3 conv, eg: https://neurohive.io/en/popular-networks/vgg16/

It is nice how you allow passing different types of observations, I will try to use it too. But do you use only self.obs['map']? For example all these global arguments, like night/day etc., can be good to concatenate to the output of CNN, instead of creating a separate layer. It seems like everything is ready for this too.

Yup, you are describing an earlier version of that notebook! I modified it to incorporate everything into the CNN layers to more closely match the imitation learning setup in case it helped. The original design had them added at the FC layer instead of adding them as layers to the CNN input.

glmcdona commented 3 years ago

Here is the example notebook shared now: https://www.kaggle.com/glmcdona/python-environment-ppo-cnn-rl-example

Note that for kaggle submission, the main_lux-ai-2021.py needs to edited to include specifying the feature extractor in the model load operation, eg something like this:

from agent_policy import AgentPolicy, CustomCombinedExtractor
...
policy_kwargs = dict(
      features_extractor_class=CustomCombinedExtractor
)
model = PPO.load(f"model.zip", policy_kwargs=policy_kwargs)

goforks12 commented 3 years ago

Hi @glmcdona , thanks for the feedback. Regarding 1, I agree that it should be OK. For the second point, you raise an interesting point about including more global game observations. Too risky though, I wouldn't want to try to make it work.

The CNN approach is what I also want to test. Have you just used CnnPolicy or something else? I want to take the network from that famous imitation learning CNN notebook and plug it instead of that one. If I checked correctly CnnPolicy is only 4 CNN layers. Additionally, I want the value part to have the same input per turn, and give additional input (unit location) only to the action part. I will let you know if it works for me!

The MLp only has 4 layers 2 layers of 64 for both the actor and the critic.

The CnnPolicy only works good on images. The api gives us all of the information without any of the noise. Cnn approach would never be able to determine if there were multiple workers on city tile for example.

goforks12 commented 3 years ago

Geoff, btw do you have any idea how to get rid of the runtime error stacking error. At around 40-50 milllion steps, too many of the games stop early because the model hasn't quite learned to save fuel during the night.n And this causes there to be compile errors if too many games end early.

glmcdona commented 3 years ago

Geoff, btw do you have any idea how to get rid of the runtime error stacking error. At around 40-50 milllion steps, too many of the games stop early because the model hasn't quite learned to save fuel during the night.n And this causes there to be compile errors if too many games end early.

Not sure what would cause this. Do you have a copy of the error by any chance? Is it a memory leak, out of memory error?

royerk commented 3 years ago

Fun fact:

All have same reward function:

White: some reference
Blue gamma_0: higher episode length, lower reward
Orange gamma_1: lower episode length, higher reward

I still have to benchmark them.

goforks12 commented 3 years ago

ocess SpawnProcess-32: Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self._kwargs) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 29, in _worker observation, reward, done, info = env.step(data) File "C:\Users\18176\Desktop\luxlux20\examples\luxai2021\env\lux_env.py", line 64, in step obs = self.learning_agent.get_observation(self.game, unit, city_tile, team, is_new_turn) File "C:\Users\18176\Desktop\luxlux20\examples\agent_policy.py", line 369, in get_observation c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] AttributeError: 'NoneType' object has no attribute 'city_id' Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 312, in _recv_bytes nread, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] The pipe has been ended

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "Other_train.py", line 191, in train(local_args) File "Other_train.py", line 163, in train model.learn(total_timesteps=args.step_count, reset_num_timesteps=True) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\ppo\ppo.py", line 310, in learn reset_num_timesteps=reset_num_timesteps, File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 237, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 178, in collect_rollouts new_obs, rewards, dones, infos = env.step(clipped_actions) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\base_vec_env.py", line 162, in step return self.step_wait() File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in step_wait results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 250, in recv buf = self._recv_bytes() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 321, in _recv_bytes raise EOFError

goforks12 commented 3 years ago

Hate to nag, but recording playing command does not seem to work and the new updated files dont compile on the kaggle server for submissions

nosound2 commented 3 years ago

Hi @goforks12 , is it a different issue now? If so, can you please open a different issue per problem. Also, more details for the second problem will be helpful, I think.

nosound2 commented 3 years ago

ocess SpawnProcess-32: Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self._kwargs) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 29, in _worker observation, reward, done, info = env.step(data) File "C:\Users\18176\Desktop\luxlux20\examples\luxai2021\env\lux_env.py", line 64, in step obs = self.learning_agent.get_observation(self.game, unit, city_tile, team, is_new_turn) File "C:\Users\18176\Desktop\luxlux20\examples\agent_policy.py", line 369, in get_observation c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] AttributeError: 'NoneType' object has no attribute 'city_id' Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 312, in _recv_bytes nread, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] The pipe has been ended

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "Other_train.py", line 191, in train(local_args) File "Other_train.py", line 163, in train model.learn(total_timesteps=args.step_count, reset_num_timesteps=True) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\ppo\ppo.py", line 310, in learn reset_num_timesteps=reset_num_timesteps, File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 237, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 178, in collect_rollouts new_obs, rewards, dones, infos = env.step(clipped_actions) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\base_vec_env.py", line 162, in step return self.step_wait() File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in step_wait results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 250, in recv buf = self._recv_bytes() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 321, in _recv_bytes raise EOFError

It seems to be a problem in your custom code, in this line specifically: c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] Are there any changes in the agent that you run, in comparison with the git version?

goforks12 commented 3 years ago

ocess SpawnProcess-32: Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self._kwargs) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 29, in _worker observation, reward, done, info = env.step(data) File "C:\Users\18176\Desktop\luxlux20\examples\luxai2021\env\lux_env.py", line 64, in step obs = self.learning_agent.get_observation(self.game, unit, city_tile, team, is_new_turn) File "C:\Users\18176\Desktop\luxlux20\examples\agent_policy.py", line 369, in get_observation c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] AttributeError: 'NoneType' object has no attribute 'city_id' Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 312, in _recv_bytes nread, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] The pipe has been ended During handling of the above exception, another exception occurred: Traceback (most recent call last): File "Other_train.py", line 191, in train(local_args) File "Other_train.py", line 163, in train model.learn(total_timesteps=args.step_count, reset_num_timesteps=True) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\ppo\ppo.py", line 310, in learn reset_num_timesteps=reset_num_timesteps, File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 237, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 178, in collect_rollouts new_obs, rewards, dones, infos = env.step(clipped_actions) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\base_vec_env.py", line 162, in step return self.step_wait() File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in step_wait results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 250, in recv buf = self._recv_bytes() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 321, in _recv_bytes raise EOFError

It seems to be a problem in your custom code, in this line specifically: c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] Are there any changes in the agent that you run, in comparison with the git version?

I didn't mess with any of the game engine. I didnt change anything within the LuxAI computations. I was however using 16 cpu cores. And my MLP I was training had much larger layers.

goforks12 commented 3 years ago

Hi @goforks12 , is it a different issue now? If so, can you please open a different issue per problem. Also, more details for the second problem will be helpful, I think.

lux-ai-2021 --seed=100 ./kaggle_submissions/main_lux-ai-2021.py ./kaggle_submissions/main_lux-ai-2021.py --maxtime 100000

I try to do this command in bash with my Model.zip and my Agent_policy.py in the kaggle submission folder. Should lux-a-2021 be a python file? Or should it be the folder we cd into to run the evaluation?

glmcdona commented 3 years ago

Hi @goforks12 , is it a different issue now? If so, can you please open a different issue per problem. Also, more details for the second problem will be helpful, I think.

lux-ai-2021 --seed=100 ./kaggle_submissions/main_lux-ai-2021.py ./kaggle_submissions/main_lux-ai-2021.py --maxtime 100000

I try to do this command in bash with my Model.zip and my Agent_policy.py in the kaggle submission folder. Should lux-a-2021 be a python file? Or should it be the folder we cd into to run the evaluation?

lux-ai-2021 is a command added by the official Lux AI repo, check out the installation instructions here if the command isn't found in your environment: https://github.com/Lux-AI-Challenge/Lux-Design-2021

glmcdona commented 3 years ago

ocess SpawnProcess-32: Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self._kwargs) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 29, in _worker observation, reward, done, info = env.step(data) File "C:\Users\18176\Desktop\luxlux20\examples\luxai2021\env\lux_env.py", line 64, in step obs = self.learning_agent.get_observation(self.game, unit, city_tile, team, is_new_turn) File "C:\Users\18176\Desktop\luxlux20\examples\agent_policy.py", line 369, in get_observation c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] AttributeError: 'NoneType' object has no attribute 'city_id' Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 312, in _recv_bytes nread, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] The pipe has been ended During handling of the above exception, another exception occurred: Traceback (most recent call last): File "Other_train.py", line 191, in train(local_args) File "Other_train.py", line 163, in train model.learn(total_timesteps=args.step_count, reset_num_timesteps=True) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\ppo\ppo.py", line 310, in learn reset_num_timesteps=reset_num_timesteps, File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 237, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 178, in collect_rollouts new_obs, rewards, dones, infos = env.step(clipped_actions) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\base_vec_env.py", line 162, in step return self.step_wait() File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in step_wait results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 250, in recv buf = self._recv_bytes() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 321, in _recv_bytes raise EOFError

It seems to be a problem in your custom code, in this line specifically: c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] Are there any changes in the agent that you run, in comparison with the git version?

I didn't mess with any of the game engine. I didnt change anything within the LuxAI computations. I was however using 16 cpu cores. And my MLP I was training had much larger layers.

If you didn't modify agent_policy.py to create your own agent yet, then I suspect there must be a rare game engine bug case where the Game.cities list is somehow not accurate, where it points to a City that actually doesn't belong to it's cell anymore. I'll have a quick look through to code to see if I can spot anything. As a workaround, you can add a try/except to the get_observation() function in agent_policy.py to ignore and log errors.

goforks12 commented 3 years ago

ocess SpawnProcess-32: Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self._kwargs) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 29, in _worker observation, reward, done, info = env.step(data) File "C:\Users\18176\Desktop\luxlux20\examples\luxai2021\env\lux_env.py", line 64, in step obs = self.learning_agent.get_observation(self.game, unit, city_tile, team, is_new_turn) File "C:\Users\18176\Desktop\luxlux20\examples\agent_policy.py", line 369, in get_observation c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] AttributeError: 'NoneType' object has no attribute 'city_id' Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 312, in _recv_bytes nread, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] The pipe has been ended During handling of the above exception, another exception occurred: Traceback (most recent call last): File "Other_train.py", line 191, in train(local_args) File "Other_train.py", line 163, in train model.learn(total_timesteps=args.step_count, reset_num_timesteps=True) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\ppo\ppo.py", line 310, in learn reset_num_timesteps=reset_num_timesteps, File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 237, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 178, in collect_rollouts new_obs, rewards, dones, infos = env.step(clipped_actions) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\base_vec_env.py", line 162, in step return self.step_wait() File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in step_wait results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 250, in recv buf = self._recv_bytes() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 321, in _recv_bytes raise EOFError

It seems to be a problem in your custom code, in this line specifically: c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] Are there any changes in the agent that you run, in comparison with the git version?

I didn't mess with any of the game engine. I didnt change anything within the LuxAI computations. I was however using 16 cpu cores. And my MLP I was training had much larger layers.

If you didn't modify agent_policy.py to create your own agent yet, then I suspect there must be a rare game engine bug case where the Game.cities list is somehow not accurate, where it points to a City that actually doesn't belong to it's cell anymore. I'll have a quick look through to code to see if I can spot anything. As a workaround, you can add a try/except to the get_observation() function in agent_policy.py to ignore and log errors.

i was doing an obscenely log training period. Will use shorter times now.

glmcdona commented 3 years ago

Here is an example training run from an 'okay' RL personal agent I've built. Notes:

This is the 'classic' reward function (the one here https://www.kaggle.com/glmcdona/reinforcement-learning-openai-ppo-with-python-game/notebook#Define-the-RL-agent-logic).
Opponent is a dummy agent that does nothing.
My agent here is a private version trying to get closer to the OpenAI Five approach.
This is CPU-only training after about 24 hours.
50 FPS
I don't include the ep_len_mean plot below, because it makes heavy use of action sequences, so they aren't very comparable.

Learning curve for a few batch sizes (n_steps is set to batch_size for each one):

Here are a couple replay files of the trained agent from the batch_size==10000 run, it's not great: replays.zip Unzip and you can view the replays here: https://2021vis.lux-ai.org/