Open nosound2 opened 3 years ago
Some personal thoughts, I am not an expert on this:
Hi @glmcdona , thanks for the feedback. Regarding 1, I agree that it should be OK. For the second point, you raise an interesting point about including more global game observations. Too risky though, I wouldn't want to try to make it work.
The CNN approach is what I also want to test. Have you just used CnnPolicy
or something else? I want to take the network from that famous imitation learning CNN notebook and plug it instead of that one. If I checked correctly CnnPolicy
is only 4 CNN layers. Additionally, I want the value part to have the same input per turn, and give additional input (unit location) only to the action part. I will let you know if it works for me!
@nosound2. Yeah, I think the built-in default CnnPolicy isn't a good fit. You can define your own layers: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html
I just shared an example notebook with you on Kaggle that I've been using. I've tried a few architectures, and the latest one is inspired by that imitation learning notebook model layout. Note, although it did work, it didn't get to as high a reward solution as the simple non-CNN example in this repo. I'm personally working on implementing a solution more similar to the OpenAI Five observation setup now.
Ok, very interesting, I am reading your notebook now. Just a small remark, I believe theoretically it is called "private sharing", which is not allowed. Let's refrain from this in the future (as long as we are not in the same team!).
Oh wow, I didn't know we couldn't share code with each other if we aren't on the same team! Thanks for the heads up.
I'll get a proper run of that notebook done and share it public.
A few comments on that notebook
compressed_map_observation
at the end ;). It is probably not good because of what we discussed.nn.Conv2d
layers in a row without activations in between, no batch norm, two max pools, and none of them is at the end, no skip connections. It is far away from all designs that I know.self.obs['map']
? For example all these global arguments, like night/day etc., can be good to concatenate to the output of CNN, instead of creating a separate layer. It seems like everything is ready for this too. Are you on the competition discord server @nosound2 ?
Regarding the architecture and whether or not the usage of skip/residual elements. The current "miner-state' has ~100 values (order of magnitude), any output of a CNN feature extractor is likely to be >10k values. Fancy architectures are great but the training time (and hyperparameters selection) is getting quickly out of hand (at least from my attempts).
I'm currently trying to inject as much human-knowledge as it is reasonable in the observation to reduce what has to be learned from scratch to improve training speed.
- The CNN model that you built is very strange. Three
nn.Conv2d
layers in a row without activations in between, no batch norm, two max pools, and none of them is at the end, no skip connections. It is far away from all designs that I know.
This is similar to a basic VGG16 model architecture, though looks like it should run a relu every 3x3 conv, eg: https://neurohive.io/en/popular-networks/vgg16/
- It is nice how you allow passing different types of observations, I will try to use it too. But do you use only
self.obs['map']
? For example all these global arguments, like night/day etc., can be good to concatenate to the output of CNN, instead of creating a separate layer. It seems like everything is ready for this too.
Yup, you are describing an earlier version of that notebook! I modified it to incorporate everything into the CNN layers to more closely match the imitation learning setup in case it helped. The original design had them added at the FC layer instead of adding them as layers to the CNN input.
Here is the example notebook shared now: https://www.kaggle.com/glmcdona/python-environment-ppo-cnn-rl-example
Note that for kaggle submission, the main_lux-ai-2021.py needs to edited to include specifying the feature extractor in the model load operation, eg something like this:
from agent_policy import AgentPolicy, CustomCombinedExtractor
...
policy_kwargs = dict(
features_extractor_class=CustomCombinedExtractor
)
model = PPO.load(f"model.zip", policy_kwargs=policy_kwargs)
Hi @glmcdona , thanks for the feedback. Regarding 1, I agree that it should be OK. For the second point, you raise an interesting point about including more global game observations. Too risky though, I wouldn't want to try to make it work.
The CNN approach is what I also want to test. Have you just used
CnnPolicy
or something else? I want to take the network from that famous imitation learning CNN notebook and plug it instead of that one. If I checked correctlyCnnPolicy
is only 4 CNN layers. Additionally, I want the value part to have the same input per turn, and give additional input (unit location) only to the action part. I will let you know if it works for me!
The MLp only has 4 layers 2 layers of 64 for both the actor and the critic.
The CnnPolicy only works good on images. The api gives us all of the information without any of the noise. Cnn approach would never be able to determine if there were multiple workers on city tile for example.
Geoff, btw do you have any idea how to get rid of the runtime error stacking error. At around 40-50 milllion steps, too many of the games stop early because the model hasn't quite learned to save fuel during the night.n And this causes there to be compile errors if too many games end early.
Geoff, btw do you have any idea how to get rid of the runtime error stacking error. At around 40-50 milllion steps, too many of the games stop early because the model hasn't quite learned to save fuel during the night.n And this causes there to be compile errors if too many games end early.
Not sure what would cause this. Do you have a copy of the error by any chance? Is it a memory leak, out of memory error?
Fun fact:
All have same reward function:
gamma_0
: higher episode length, lower rewardgamma_1
: lower episode length, higher rewardI still have to benchmark them.
ocess SpawnProcess-32: Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self._kwargs) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 29, in _worker observation, reward, done, info = env.step(data) File "C:\Users\18176\Desktop\luxlux20\examples\luxai2021\env\lux_env.py", line 64, in step obs = self.learning_agent.get_observation(self.game, unit, city_tile, team, is_new_turn) File "C:\Users\18176\Desktop\luxlux20\examples\agent_policy.py", line 369, in get_observation c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] AttributeError: 'NoneType' object has no attribute 'city_id' Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 312, in _recv_bytes nread, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] The pipe has been ended
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "Other_train.py", line 191, in
Hate to nag, but recording playing command does not seem to work and the new updated files dont compile on the kaggle server for submissions
Hi @goforks12 , is it a different issue now? If so, can you please open a different issue per problem. Also, more details for the second problem will be helpful, I think.
ocess SpawnProcess-32: Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self._kwargs) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 29, in _worker observation, reward, done, info = env.step(data) File "C:\Users\18176\Desktop\luxlux20\examples\luxai2021\env\lux_env.py", line 64, in step obs = self.learning_agent.get_observation(self.game, unit, city_tile, team, is_new_turn) File "C:\Users\18176\Desktop\luxlux20\examples\agent_policy.py", line 369, in get_observation c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] AttributeError: 'NoneType' object has no attribute 'city_id' Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 312, in _recv_bytes nread, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] The pipe has been ended
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "Other_train.py", line 191, in train(local_args) File "Other_train.py", line 163, in train model.learn(total_timesteps=args.step_count, reset_num_timesteps=True) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\ppo\ppo.py", line 310, in learn reset_num_timesteps=reset_num_timesteps, File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 237, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 178, in collect_rollouts new_obs, rewards, dones, infos = env.step(clipped_actions) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\base_vec_env.py", line 162, in step return self.step_wait() File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in step_wait results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 250, in recv buf = self._recv_bytes() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 321, in _recv_bytes raise EOFError
It seems to be a problem in your custom code, in this line specifically:
c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id]
Are there any changes in the agent that you run, in comparison with the git version?
ocess SpawnProcess-32: Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self._kwargs) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 29, in _worker observation, reward, done, info = env.step(data) File "C:\Users\18176\Desktop\luxlux20\examples\luxai2021\env\lux_env.py", line 64, in step obs = self.learning_agent.get_observation(self.game, unit, city_tile, team, is_new_turn) File "C:\Users\18176\Desktop\luxlux20\examples\agent_policy.py", line 369, in get_observation c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] AttributeError: 'NoneType' object has no attribute 'city_id' Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 312, in _recv_bytes nread, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] The pipe has been ended During handling of the above exception, another exception occurred: Traceback (most recent call last): File "Other_train.py", line 191, in train(local_args) File "Other_train.py", line 163, in train model.learn(total_timesteps=args.step_count, reset_num_timesteps=True) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\ppo\ppo.py", line 310, in learn reset_num_timesteps=reset_num_timesteps, File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 237, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 178, in collect_rollouts new_obs, rewards, dones, infos = env.step(clipped_actions) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\base_vec_env.py", line 162, in step return self.step_wait() File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in step_wait results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 250, in recv buf = self._recv_bytes() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 321, in _recv_bytes raise EOFError
It seems to be a problem in your custom code, in this line specifically:
c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id]
Are there any changes in the agent that you run, in comparison with the git version?
I didn't mess with any of the game engine. I didnt change anything within the LuxAI computations. I was however using 16 cpu cores. And my MLP I was training had much larger layers.
Hi @goforks12 , is it a different issue now? If so, can you please open a different issue per problem. Also, more details for the second problem will be helpful, I think.
lux-ai-2021 --seed=100 ./kaggle_submissions/main_lux-ai-2021.py ./kaggle_submissions/main_lux-ai-2021.py --maxtime 100000
I try to do this command in bash with my Model.zip and my Agent_policy.py in the kaggle submission folder. Should lux-a-2021 be a python file? Or should it be the folder we cd into to run the evaluation?
Hi @goforks12 , is it a different issue now? If so, can you please open a different issue per problem. Also, more details for the second problem will be helpful, I think.
lux-ai-2021 --seed=100 ./kaggle_submissions/main_lux-ai-2021.py ./kaggle_submissions/main_lux-ai-2021.py --maxtime 100000
I try to do this command in bash with my Model.zip and my Agent_policy.py in the kaggle submission folder. Should lux-a-2021 be a python file? Or should it be the folder we cd into to run the evaluation?
lux-ai-2021
is a command added by the official Lux AI repo, check out the installation instructions here if the command isn't found in your environment:
https://github.com/Lux-AI-Challenge/Lux-Design-2021
ocess SpawnProcess-32: Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self._kwargs) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 29, in _worker observation, reward, done, info = env.step(data) File "C:\Users\18176\Desktop\luxlux20\examples\luxai2021\env\lux_env.py", line 64, in step obs = self.learning_agent.get_observation(self.game, unit, city_tile, team, is_new_turn) File "C:\Users\18176\Desktop\luxlux20\examples\agent_policy.py", line 369, in get_observation c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] AttributeError: 'NoneType' object has no attribute 'city_id' Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 312, in _recv_bytes nread, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] The pipe has been ended During handling of the above exception, another exception occurred: Traceback (most recent call last): File "Other_train.py", line 191, in train(local_args) File "Other_train.py", line 163, in train model.learn(total_timesteps=args.step_count, reset_num_timesteps=True) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\ppo\ppo.py", line 310, in learn reset_num_timesteps=reset_num_timesteps, File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 237, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 178, in collect_rollouts new_obs, rewards, dones, infos = env.step(clipped_actions) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\base_vec_env.py", line 162, in step return self.step_wait() File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in step_wait results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 250, in recv buf = self._recv_bytes() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 321, in _recv_bytes raise EOFError
It seems to be a problem in your custom code, in this line specifically:
c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id]
Are there any changes in the agent that you run, in comparison with the git version?I didn't mess with any of the game engine. I didnt change anything within the LuxAI computations. I was however using 16 cpu cores. And my MLP I was training had much larger layers.
If you didn't modify agent_policy.py
to create your own agent yet, then I suspect there must be a rare game engine bug case where the Game.cities
list is somehow not accurate, where it points to a City that actually doesn't belong to it's cell anymore. I'll have a quick look through to code to see if I can spot anything. As a workaround, you can add a try/except to the get_observation()
function in agent_policy.py
to ignore and log errors.
ocess SpawnProcess-32: Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 297, in _bootstrap self.run() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\process.py", line 99, in run self._target(*self._args, **self._kwargs) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 29, in _worker observation, reward, done, info = env.step(data) File "C:\Users\18176\Desktop\luxlux20\examples\luxai2021\env\lux_env.py", line 64, in step obs = self.learning_agent.get_observation(self.game, unit, city_tile, team, is_new_turn) File "C:\Users\18176\Desktop\luxlux20\examples\agent_policy.py", line 369, in get_observation c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id] AttributeError: 'NoneType' object has no attribute 'city_id' Traceback (most recent call last): File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 312, in _recv_bytes nread, err = ov.GetOverlappedResult(True) BrokenPipeError: [WinError 109] The pipe has been ended During handling of the above exception, another exception occurred: Traceback (most recent call last): File "Other_train.py", line 191, in train(local_args) File "Other_train.py", line 163, in train model.learn(total_timesteps=args.step_count, reset_num_timesteps=True) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\ppo\ppo.py", line 310, in learn reset_num_timesteps=reset_num_timesteps, File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 237, in learn continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\on_policy_algorithm.py", line 178, in collect_rollouts new_obs, rewards, dones, infos = env.step(clipped_actions) File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\base_vec_env.py", line 162, in step return self.step_wait() File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in step_wait results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\Desktop\luxlux20\examples\stable_baselines3\common\vec_env\subproc_vec_env.py", line 120, in results = [remote.recv() for remote in self.remotes] File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 250, in recv buf = self._recv_bytes() File "C:\Users\18176\anaconda3\envs\pythree\lib\multiprocessing\connection.py", line 321, in _recv_bytes raise EOFError
It seems to be a problem in your custom code, in this line specifically:
c = game.cities[game.map.get_cell_by_pos(closest_position).city_tile.city_id]
Are there any changes in the agent that you run, in comparison with the git version?I didn't mess with any of the game engine. I didnt change anything within the LuxAI computations. I was however using 16 cpu cores. And my MLP I was training had much larger layers.
If you didn't modify
agent_policy.py
to create your own agent yet, then I suspect there must be a rare game engine bug case where theGame.cities
list is somehow not accurate, where it points to a City that actually doesn't belong to it's cell anymore. I'll have a quick look through to code to see if I can spot anything. As a workaround, you can add a try/except to theget_observation()
function inagent_policy.py
to ignore and log errors.
i was doing an obscenely log training period. Will use shorter times now.
Here is an example training run from an 'okay' RL personal agent I've built. Notes:
ep_len_mean
plot below, because it makes heavy use of action sequences, so they aren't very comparable.Learning curve for a few batch sizes (n_steps is set to batch_size for each one):
Here are a couple replay files of the trained agent from the batch_size==10000
run, it's not great:
replays.zip
Unzip and you can view the replays here:
https://2021vis.lux-ai.org/
I think about the learning design that is implemented here, and I just can't resolve to myself two questions. The core function for the learning is the environment step function. The chain of learning is
[OBS_UNIT1 -> ACTION1 -> REWARD -> OBS_UNIT2 -> ACTION2 -> OBS_UNIT3 -> ACTION3 ... -> ALL TURN ACTIONS ARE ACTUALLY TAKEN] -> [THE SAME FOR THE NEXT TURN ...]
. The questions are:Less important. Only the first action gets reward. Doesn't it create significant problems, especially when the number of units per turn is big? Especially if the discount factor
gamma
is small, but also in general. Even this intermediate reward for most actions is delayed. I wonder how much harder the life is for the model because of this. One thing, - the ordering of the units to act can be important. I can imagine that the model can handle it. But is there an example of multi-unit problems that are designed like this?More important. The algorithms like TD(0), Q-Learning, and more involved like PPO, all depend for the model update not only on the current state (or state-action pair) but also the next one. But the next step is a different unit, its observation is unit-dependent, its value function is completely different, and barely related. The process is basically not markovian, the states are heavily incomplete information, and each time different incomplete information. Isn't it a no-go? Or I miss-understand something major?
Please share your thought!