glmcdona / LuxPythonEnvGym

Matching python environment code for Lux AI 2021 Kaggle competition, and a gym interface for RL models.
MIT License
73 stars 38 forks source link

Error in Kaggle submission #89

Closed hokhay closed 2 years ago

hokhay commented 2 years ago

Hi,

I have encountered error after kaggle submission. The following is error log from the game play in Kaggle. The game only plays for 1 turn and then stop. I used Python 3.7 to train the model

[[{"duration": 9.627871, "stdout": "", "stderr": "Traceback (most recent call last):\n  
File \"./main_lux-ai-2021.py\", line 23, in <module>\n    
model = PPO.load(f\"model.zip\")\n  
File \"/kaggle_simulations/agent/stable_baselines3/common/base_class.py\", line 651, in load\n    
data, params, pytorch_variables = load_from_zip_file(path, device=device, custom_objects=custom_objects)\n  
File \"/kaggle_simulations/agent/stable_baselines3/common/save_util.py\", line 402, in load_from_zip_file\n    
data = json_to_data(json_data, custom_objects=custom_objects)\n  
File \"/kaggle_simulations/agent/stable_baselines3/common/save_util.py\", line 164, in json_to_data\n    
deserialized_object = cloudpickle.loads(base64_object)\n
ValueError: unsupported pickle protocol: 5\n"}],
 [{"duration": 0.004913, "stdout": "", "stderr": "Traceback (most recent call last):\n  
File \"/opt/conda/lib/python3.7/site-packages/kaggle_environments/agent.py\", line 157, in act\n    
action = self.agent(*args)\n  
File \"/opt/conda/lib/python3.7/site-packages/kaggle_environments/agent.py\", line 129, in callable_agent\n    
if callable(agent) \\\n  
File \"/kaggle_simulations/agent/main.py\", line 76, in python_policy_agent\n    
agent_process.stdin.flush()\n
BrokenPipeError: [Errno 32] Broken pipe\n"}]]

Have any encountered this error as well?

Thanks Jason

glmcdona commented 2 years ago

I think this is the key part of the error message: ValueError: unsupported pickle protocol: 5

What's happened is the model.zip was trained on a python environment with I believe a different cloudpickle version. Which version of python are you using for training?

This might help: https://stackoverflow.com/questions/63329657/python-3-7-error-unsupported-pickle-protocol-5

hokhay commented 2 years ago

Sorry I am not sure what cloudpickle is but I am wondering if the problem is caused by me using Pycharm virtual environment for training instead of system Python

glmcdona commented 2 years ago

Cloudpickle is the library that stable_baselines3 uses to serialize the resulting model file down into the model.zip. Likely changing your Pycharm virtual environment to one using Python 3.7.* will fix it. You may be able to look at the stackoverflow link above for ways to convert your model.zip format as an alternative.

royerk commented 2 years ago

I had this exact issue and it was coming from using a version of python higher than the 3.7 that kaggle env is using. The python version is critical as to which version of pickle is used and there are some incompatibilities.

PyCharm isn't the issue, you should be able to change the interpreter from your current python version to python3.7 (you may have to install python 3.7 on your machine). If you have multiple versions of python on your machine then PyCharm can make a virtual environment for you based on a specific version of python.

hokhay commented 2 years ago

Thank you guys for help me out. I am trying to re-run the training with Python 3.7 then.

hokhay commented 2 years ago

Hey guys,

After changing to Python 3.7, I think the Python version issue is gone but I got another two error message. This is more like a the program issue

[[{"duration": 8.578081, "stdout": "", "stderr": "Traceback (most recent call last):\n  File \"./main_lux-ai-2021.py\", line 23, in <module>\n    
model = PPO.load(f\"model.zip\")\n  
File \"/kaggle_simulations/agent/stable_baselines3/common/base_class.py\", line 688, in load\n    
model._setup_model()\n  File \"/kaggle_simulations/agent/stable_baselines3/ppo/ppo.py\", line 155, in _setup_model\n    super(PPO, self)._setup_model()\n  
File \"/kaggle_simulations/agent/stable_baselines3/common/on_policy_algorithm.py\", line 118, in _setup_model\n    n_envs=self.n_envs,\n  
File \"/kaggle_simulations/agent/stable_baselines3/common/buffers.py\", line 328, in __init__\n    
super(RolloutBuffer, self).__init__(buffer_size, observation_space, action_space, device, n_envs=n_envs)\n  File \"/kaggle_simulations/agent/stable_baselines3/common/buffers.py\", line 49, in __init__\n    
self.obs_shape = get_obs_shape(observation_space)\n  
File \"/kaggle_simulations/agent/stable_baselines3/common/preprocessing.py\", line 144, in get_obs_shape\n    
return observation_space.shape\nAttributeError: 'Box' object has"}],
 [{"duration": 0.004079, "stdout": "", "stderr": "Traceback (most recent call last):\n  
File \"/opt/conda/lib/python3.7/site-packages/kaggle_environments/agent.py\", line 157, in act\n    action = self.agent(*args)\n  File \"/opt/conda/lib/python3.7/site-packages/kaggle_environments/agent.py\", line 129, in callable_agent\n    
if callable(agent) \\\n  
File \"/kaggle_simulations/agent/main.py\", line 76, in python_policy_agent\n    
agent_process.stdin.flush()\nBrokenPipeError: [Errno 32] Broken pipe\n"}]]

Jason

glmcdona commented 2 years ago

Hrm, I think I may have encountered this error before and I think I remember what it is. Are you using a dictionary observation space? If I recall, in training it worked fine if you had extra dictionary keys in the observation in training, but had an error like this in inference. The solution was to remove any dictionary key's in the observation that aren't used.

Also, try to get it working locally first, eg does this work? https://github.com/glmcdona/LuxPythonEnvGym#creating-and-viewing-a-replay

lux-ai-2021 ./kaggle_submissions/main_lux-ai-2021.py ./kaggle_submissions/main_lux-ai-2021.py --maxtime 100000
hokhay commented 2 years ago

I was using 2 weeks ago version codes from this github and there was no error when I run the locally. Then now I download the most update version of the codes, I have replicated the same error Kaggle, so is there any change to the codes that could produce this error?

I am now running into this error even with the original codes from here. Could you give me clue of what I need to modify in the observation?

Thank you a lot Jason

royerk commented 2 years ago

Two things come to mind:

I have been using the repo without any issue for submissions, my guess is that something in your environment is not set as kaggle expects.

hokhay commented 2 years ago

One thing I find out that is when I run locally lux-ai-2021 kaggle_submissions/main_lux-ai-2021.py kaggle_submissions/main_lux-ai-2021.py --maxtime 100000, I can get the AttributeError: 'Box' object has no attribute 'shape' as on Kaggle. I suppose this is the cause of my default Python version of 3.8.

However when I run lux-ai-2021 --python python3.7 kaggle_submissions/main_lux-ai-2021.py kaggle_submissions/main_lux-ai-2021.py --maxtime 100000, the simulation can run successfully. This show that my model was trained under the Python 3.7 env, so it can only run in 3.7 env.

Therefore I am confused that why I get the same error on Kaggle as the one I get locally when using Python 3.8.

Thanks a lot Jason

hokhay commented 2 years ago

I have figured it out. It turns out that my Gym version is 0.2 while the codes required version <0.2. The issue is solved when I reinstall Gym 0.19.

This issue maybe avoid if the Gym version can be specified in setup.py

Thanks a lot for the help from you guys