[Help Request] Failed to reset when using multiprocessing

JiyuanTHU commented 7 months ago

High Level Description

When I use multiprocessing to create multi envs. Sometimes, the following error occurs but not always. The error log is attached. I hope there is someone who can help me with this.

Version

smarts 1.4.0 sumo 1.19.0

Operating System

Ubuntu 20.04

Problems

ERROR:SMARTS:Failed to successfully reset after 1 tries. Process Process-64: Traceback (most recent call last): File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/env/worker/subproc.py", line 96, in _worker obs, info = env.reset(data) File "/home/yuan/Ji_ws/Learning-from-Intervention/utils/env_wrapper.py", line 155, in reset obs, info = self.env.reset(seed=seed, options=options) File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/env/gymnasium/wrappers/single_agent.py", line 78, in reset obs, info = self.env.reset(seed=seed, options=options) File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/gymnasium/wrappers/order_enforcing.py", line 61, in reset return self.env.reset(*kwargs) File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/env/gymnasium/hiway_env_v1.py", line 348, in reset observations = self._smarts.reset( File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/smarts.py", line 471, in reset raise first_exception File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/smarts.py", line 464, in reset return self._reset(scenario, start_time) File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/smarts.py", line 501, in _reset self.setup(scenario) File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/smarts.py", line 539, in setup provider_state = self._setup_providers(self._scenario) File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/smarts.py", line 1230, in _setup_providers new_provider_state = self._handle_provider(provider, provider_error) File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/smarts.py", line 1265, in _handle_provider provider_state, recovered = provider.recover( File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/sumo_traffic_simulation.py", line 463, in recover raise error File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/smarts.py", line 1228, in _setup_providers new_provider_state = provider.setup(scenario) File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/sumo_traffic_simulation.py", line 324, in setup self._initialize_traci_conn() File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/site-packages/smarts/core/sumo_traffic_simulation.py", line 239, in _initialize_traci_conn self._traci_conn.setOrder(0) TypeError: 'NoneType' object is not callable Traceback (most recent call last): File "main_smarts_tianshou_safe.py", line 409, in test_discrete_sac() File "main_smarts_tianshou_safe.py", line 209, in test_discrete_sac result = offpolicy_trainer( File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/trainer/offpolicy.py", line 133, in offpolicy_trainer return OffpolicyTrainer(args, kwargs).run() File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/trainer/base.py", line 441, in run deque(self, maxlen=0) # feed the entire iterator into a zero-length deque File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/trainer/base.py", line 315, in next test_stat, self.stop_fn_flag = self.test_step() File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/trainer/base.py", line 344, in test_step test_result = test_episode( File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/trainer/utils.py", line 27, in test_episode result = collector.collect(n_episode=n_episode) File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/data/collector.py", line 344, in collect self._reset_env_with_ids( File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/data/collector.py", line 174, in _reset_env_with_ids obs_reset, info = self.env.reset(global_ids, gym_reset_kwargs) File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/env/venvs.py", line 282, in reset ret_list = [self.workers[i].recv() for i in id] File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/env/venvs.py", line 282, in ret_list = [self.workers[i].recv() for i in id] File "/home/yuan/Ji_ws/Learning-from-Intervention/tianshou/env/worker/subproc.py", line 204, in recv result = self.parent_remote.recv() File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/home/yuan/anaconda3/envs/safe-smarts/lib/python3.8/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError

Gamenot commented 7 months ago

Hello, multiprocessing has to use centralized TraCI server generation because there are race conditions when using the conventional means of how to acquire a port from the OS. https://smarts.readthedocs.io/en/latest/ecosystem/sumo.html#centralized-traci-management

Full context of why this is the case is here: #2139

JiyuanTHU commented 7 months ago

Thanks! This helps

huawei-noah / SMARTS