huawei-noah / SMARTS

Scalable Multi-Agent RL Training School for Autonomous Driving
MIT License
942 stars 190 forks source link

Couldn't run baseline with ultra/train.py, Envision terminated #1125

Closed hz3014 closed 2 years ago

hz3014 commented 2 years ago

BUG REPORT

High Level Description Trying to run baseline of ULTRA. python ultra/train.py seems doesn't interact with Envision, and terminated with ERROR:Client:Connection to Envision terminated with: [Errno 111] Connection refused

SMARTS version 0.4.18 master (tried develop branch almost the same)

Steps to reproduce the bug Under SMARTS, pip install -e .[train], sanity test passed successfully
Switching to SMARTS/ultra, pip install -e built with no error. scl scenario build-all ultra/scenarios/pool/experiment_pool/ python ultra/scenarios/interface.py generate --task 1 --level easy ./ultra/env/envision_base.sh python ultra/train.py --task 1 --level easy --episodes 10 --eval-episodes 2 --eval-rate 5 --policy dqn

Resulting and expected behaviour Program terminates within few minites. http://localhost:8081/#/ visualizer seems not loading anything and it is all black

Error logs and screenshots tried pytest -v tests/test_train.py under SMARTS/ultra as @Adaickalavan suggested.

tests/test_train.py::TrainTest::test_agent_is_instance_policy PASSED [ 14%] tests/test_train.py::TrainTest::test_check_agents_from_pool PASSED [ 28%] tests/test_train.py::TrainTest::test_spec_is_instance_agentspec PASSED [ 42%] tests/test_train.py::TrainTest::test_train_cli PASSED [ 57%] tests/test_train.py::TrainTest::test_train_cli_multiagent PASSED [ 71%] tests/test_train.py::TrainTest::test_train_multiagent pybullet build time: Oct 8 2020 00:10:46 FAILED [ 85%] tests/test_train.py::TrainTest::test_train_single_agent FAILED [100%]

error after ultra/train.py

python ultra/train.py --task 1 --level easy --episodes 10 --eval-episodes 2 --eval-rate 5 --policy dqn WARNING:tensorflow:From /home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term 2021-11-18 22:18:22,651 INFO services.py:1092 -- View the Ray dashboard at http://127.0.0.1:8265 pybullet build time: Oct 8 2020 00:10:46 /home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32 warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow')) ERROR:Client:Connection to Envision terminated with: [Errno 111] Connection refused ERROR:websocket:error from callback <function Client._connect..on_close at 0x7fe0dd9d6c20>: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: on_close() takes 1 positional argument but 3 were given ╭────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────╮ │ Episode │ Sim/Wall │ Total Steps │ Steps/Sec │ Score │ Goal Completed │ ├────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┤ ╰────────────────────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┴────────────────────╯ Traceback (most recent call last): File "ultra/train.py", line 337, in policy_ids=policy_ids, File "ultra/train.py", line 211, in train observations = env.reset() File "/home/avt/SMARTS/ultra/ultra/env/ultra_env.py", line 130, in reset scenario = next(self._scenarios_iterator) StopIteration pybullet build time: Oct 8 2020 00:10:46 pybullet build time: Oct 8 2020 00:10:46 pybullet build time: Oct 8 2020 00:10:46 ERROR:Client:Connection to Envision terminated with: [Errno 111] Connection refused ERROR:websocket:error from callback <function Client._connect..on_close at 0x7fe0dd9d6c20>: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: [Errno 111] Connection refused ERROR:websocket:error from callback <function Client._connect..on_close at 0x7fe0dd9d6c20>: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: [Errno 111] Connection refused ERROR:websocket:error from callback <function Client._connect..on_close at 0x7fe0dd9d6c20>: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: [Errno 111] Connection refused ERROR:websocket:error from callback <function Client._connect..on_close at 0x7fe0dd9d6c20>: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: [Errno 111] Connection refused ERROR:websocket:error from callback <function Client._connect..on_close at 0x7fe0dd9d6c20>: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: [Errno 111] Connection refused ERROR:websocket:error from callback <function Client._connect..on_close at 0x7fe0dd9d6c20>: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: [Errno 111] Connection refused ERROR:websocket:error from callback <function Client._connect..on_close at 0x7fe0dd9d6c20>: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: [Errno 111] Connection refused ERROR:websocket:error from callback <function Client._connect..on_close at 0x7fe0dd9d6c20>: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: [Errno 111] Connection refused ERROR:websocket:error from callback <function Client._connect..on_close at 0x7fe0dd9d6c20>: on_close() takes 1 positional argument but 3 were given ERROR:Client:Connection to Envision terminated with: on_close() takes 1 positional argument but 3 were given ERROR:RemoteAgentBuffer:Exception while tearing down buffered remote agent. ValueError('Cannot invoke RPC on closed channel!') Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/remote_agent_buffer.py", line 108, in destroy raise e File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/remote_agent_buffer.py", line 103, in destroy remote_agent.terminate() File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/remote_agent.py", line 85, in terminate manager_pb2.Port(num=self._worker_address[1]) File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/grpc/_channel.py", line 825, in call wait_for_ready, compression) File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/grpc/_channel.py", line 812, in _blocking ),), self._context) File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 498, in grpc._cython.cygrpc.Channel.segregated_call File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 353, in grpc._cython.cygrpc._segregated_call File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 357, in grpc._cython.cygrpc._segregated_call ValueError: Cannot invoke RPC on closed channel! Aborted at 1637245428 (unix time) try "date -d @1637245428" if you are using GNU date PC: @ 0x0 (unknown) SIGTERM (@0x3e800005cd3) received by PID 24298 (TID 0x7fe4251af740) from PID 23763; stack trace: @ 0x7fe424c24040 (unknown) @ 0x7fe424cfbe1f __select @ 0x4e8732 (unknown) @ 0x5d8c81 _PyMethodDef_RawFastCallKeywords @ 0x54bbe0 (unknown) @ 0x552d2d _PyEval_EvalFrameDefault @ 0x54cb89 _PyEval_EvalCodeWithName @ 0x5d9a82 _PyFunction_FastCallKeywords @ 0x54ee40 _PyEval_EvalFrameDefault @ 0x54cb89 _PyEval_EvalCodeWithName @ 0x5dac6e _PyFunction_FastCallDict @ 0x4d9482 (unknown) @ 0x5dd066 PyObject_Call @ 0x550267 _PyEval_EvalFrameDefault @ 0x5d977c _PyFunction_FastCallKeywords @ 0x54efdc _PyEval_EvalFrameDefault @ 0x5d977c _PyFunction_FastCallKeywords @ 0x54efdc _PyEval_EvalFrameDefault @ 0x5d977c _PyFunction_FastCallKeywords @ 0x54efdc _PyEval_EvalFrameDefault @ 0x5daaa6 _PyFunction_FastCallDict @ 0x590713 (unknown) @ 0x5da1c9 _PyObject_FastCallKeywords @ 0x552fb7 _PyEval_EvalFrameDefault @ 0x5d977c _PyFunction_FastCallKeywords @ 0x54baa0 (unknown) @ 0x552d2d _PyEval_EvalFrameDefault @ 0x5d977c _PyFunction_FastCallKeywords @ 0x54baa0 (unknown) @ 0x552d2d _PyEval_EvalFrameDefault @ 0x5d977c _PyFunction_FastCallKeywords @ 0x54efdc _PyEval_EvalFrameDefault

System information

Impact [If known]

Adaickalavan commented 2 years ago
  1. For all ultra matters, please use the ultra-develop branch.
  2. Instructions for getting started with ultra are indexed at SMARTS/ultra/README.md in the ultra-develop branch.
  3. Let us know if errors still occur after following the instructions at ultra-develop branch.
hz3014 commented 2 years ago
1. For all ultra matters, please use the `ultra-develop` branch.

2. Instructions for getting started with ultra are indexed at `SMARTS/ultra/README.md`   in the `ultra-develop` branch.

3. Let us know if errors still occur after following the instructions at `ultra-develop` branch.

Thank you for the reply. I followed the exact process as SMARTS/ultra/README.md. Unfortunately, I am still having similar error after i switched to ultra-develop branch. pip install -e . is successful. However, Envision is empty, pytest -v tests/test_train.py does not pass and ultra/train.py give following

python ultra/train.py --task 1 --level easy --episodes 10 --eval-episodes 2 --eval-rate 5 --policy sac WARNING:tensorflow:From /home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term 2021-11-19 12:02:04,865 INFO services.py:1092 -- View the Ray dashboard at http://127.0.0.1:8265 /home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32 warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow')) /home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/env/hiway_env.py:101: DeprecationWarning: timestep_sec has been deprecated in favor of fixed_timestep_sec. Please update your code. category=DeprecationWarning, ╭─────────────────┬─────────────────┬─────────────────┬─────────────────┬────────────────────┬────────────────────┬──────────────────────────────────────────────────────────────╮ │ Episode │ Sim/Wall │ Total Steps │ Steps/Sec │ Score │ Goal Completed │ Scenario Name │ ├─────────────────┼─────────────────┼─────────────────┼─────────────────┼────────────────────┼────────────────────┼──────────────────────────────────────────────────────────────┤ <<<<<<< MODEL SAVED >>>>>>>>> logs/experiment-2021.11.19-12:8:2-sac-v0/models/000/0 ╰─────────────────┴─────────────────┴─────────────────┴─────────────────┴────────────────────┴────────────────────┴──────────────────────────────────────────────────────────────╯ /usr/lib/python3.7/subprocess.py:883: ResourceWarning: subprocess 32140 is still running ResourceWarning, source=self) ResourceWarning: Enable tracemalloc to get the object allocation traceback Traceback (most recent call last): File "ultra/train.py", line 395, in policy_ids=policy_ids, File "ultra/train.py", line 314, in train for agent_id, observation in observations.items() File "ultra/train.py", line 314, in for agentid, observation in observations.items() File "/home/avt/SMARTS/ultra/ultra/baselines/sac/sac/policy.py", line 175, in act action, , mean = self.sac_net.sample(state) File "/home/avt/SMARTS/ultra/ultra/baselines/sac/sac/network.py", line 96, in sample return self.actor(state, training=training) File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/home/avt/SMARTS/ultra/ultra/baselines/sac/sac/network.py", line 217, in forward social_vehicles_state, training File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/home/avt/SMARTS/ultra/ultra/baselines/common/social_vehicles_encoders/precog_encoder.py", line 63, in forward social_embeddings = self.social_net(social_features) File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward input = module(input) File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward return F.linear(input, self.weight, self.bias) File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/torch/nn/functional.py", line 1370, in linear ret = torch.addmm(bias, input, weight.t()) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc) Aborted at 1637294885 (unix time) try "date -d @1637294885" if you are using GNU date PC: @ 0x0 (unknown) SIGTERM (@0x3e800007d6f) received by PID 32698 (TID 0x7f5f33ba8740) from PID 32111; stack trace: *** @ 0x7f5f3361d040 (unknown) @ 0x7f5f336ee184 __read @ 0x529307 _Py_read @ 0x4f4902 (unknown) @ 0x5d8da3 _PyMethodDef_RawFastCallKeywords @ 0x552bd3 _PyEval_EvalFrameDefault @ 0x54c522 _PyEval_EvalCodeWithName @ 0x5d9a82 _PyFunction_FastCallKeywords @ 0x54efdc _PyEval_EvalFrameDefault @ 0x54c522 _PyEval_EvalCodeWithName @ 0x5d9a82 _PyFunction_FastCallKeywords @ 0x54efdc _PyEval_EvalFrameDefault @ 0x54c522 _PyEval_EvalCodeWithName @ 0x5d9a82 _PyFunction_FastCallKeywords @ 0x54baa0 (unknown) @ 0x552d2d _PyEval_EvalFrameDefault @ 0x54c522 _PyEval_EvalCodeWithName @ 0x5d9a82 _PyFunction_FastCallKeywords @ 0x54efdc _PyEval_EvalFrameDefault @ 0x54cb89 _PyEval_EvalCodeWithName @ 0x5dac6e _PyFunction_FastCallDict @ 0x550267 _PyEval_EvalFrameDefault @ 0x54c522 _PyEval_EvalCodeWithName @ 0x5d9a82 _PyFunction_FastCallKeywords @ 0x54efdc _PyEval_EvalFrameDefault @ 0x54cb89 _PyEval_EvalCodeWithName @ 0x5d9a82 _PyFunction_FastCallKeywords @ 0x54efdc _PyEval_EvalFrameDefault @ 0x54cb89 _PyEval_EvalCodeWithName @ 0x5d9a82 _PyFunction_FastCallKeywords @ 0x54ee40 _PyEval_EvalFrameDefault @ 0x54cb89 _PyEval_EvalCodeWithName

Adaickalavan commented 2 years ago
  1. Given that we see the error "CUDA error: CUBLAS_STATUS_EXECUTION_FAILED", are you using GPU?
  2. Does the error occur while using only CPU?
hz3014 commented 2 years ago
1. Given that we see the error "CUDA error: CUBLAS_STATUS_EXECUTION_FAILED", are you using GPU?

2. Does the error occur while using only CPU?
  1. I will come back to solve the environment dependency for gpu training
  2. I changed the policy code to run on purely CPU, still have error:
  3. Envision just seems not loading anything, it was loading when i was doing pytest -v tests/test_train.py

python ultra/train.py --task 1 --level easy --episodes 10 --eval-episodes 2 --eval-rate 5 --policy sac WARNING:tensorflow:From /home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term 2021-11-19 14:21:24,949 INFO services.py:1092 -- View the Ray dashboard at http://127.0.0.1:8265 /home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32 warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow')) /home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/env/hiway_env.py:101: DeprecationWarning: timestep_sec has been deprecated in favor of fixed_timestep_sec. Please update your code. category=DeprecationWarning, ╭─────────────────┬─────────────────┬─────────────────┬─────────────────┬────────────────────┬────────────────────┬──────────────────────────────────────────────────────────────╮ │ Episode │ Sim/Wall │ Total Steps │ Steps/Sec │ Score │ Goal Completed │ Scenario Name │ ├─────────────────┼─────────────────┼─────────────────┼─────────────────┼────────────────────┼────────────────────┼──────────────────────────────────────────────────────────────┤ <<<<<<< MODEL SAVED >>>>>>>>> logs/experiment-2021.11.19-14:21:32-sac-v0/models/000/0 ERROR:SMARTS:Simulation crashed with exception. Attempting to cleanly shutdown. ERROR:SMARTS:shape() takes from 1 to 2 positional arguments but 3 were given Traceback (most recent call last): File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/smarts.py", line 164, in step return self._step(agent_actions, time_delta_since_last_step) File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/smarts.py", line 267, in _step self._try_emit_envision_state(provider_state, observations, scores) File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/smarts.py", line 1059, in _try_emit_envision_state v.vehicle_id File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/cached_property.py", line 36, in get value = obj.dict[self.func.name] = self.func(obj) File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/sumo_road_network.py", line 839, in geometry for road in self.roads File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/sumo_road_network.py", line 839, in for road in self.roads TypeError: shape() takes from 1 to 2 positional arguments but 3 were given ╰─────────────────┴─────────────────┴─────────────────┴─────────────────┴────────────────────┴────────────────────┴──────────────────────────────────────────────────────────────╯ Traceback (most recent call last): File "ultra/train.py", line 395, in policy_ids=policy_ids, File "ultra/train.py", line 316, in train next_observations, rewards, dones, infos = env.step(actions) File "/home/avt/SMARTS/ultra/ultra/env/ultra_env.py", line 95, in step agent_actions File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/smarts.py", line 164, in step return self._step(agent_actions, time_delta_since_last_step) File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/smarts.py", line 267, in _step self._try_emit_envision_state(provider_state, observations, scores) File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/smarts.py", line 1059, in _try_emit_envision_state v.vehicle_id File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/cached_property.py", line 36, in get value = obj.dict[self.func.name] = self.func(obj) File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/sumo_road_network.py", line 839, in geometry for road in self.roads File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/sumo_road_network.py", line 839, in for road in self.roads TypeError: shape() takes from 1 to 2 positional arguments but 3 were given Exception ignored in: <function SMARTS.del at 0x7f7f00162170> Traceback (most recent call last): File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/smarts.py", line 505, in del File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/smarts.py", line 483, in destroy File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/smarts.py", line 462, in teardown File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/utils/cache.py", line 130, in wrapper File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/vehicle_index.py", line 318, in teardown File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/vehicle_index.py", line 702, in _build_empty_controlled_by TypeError: 'NoneType' object is not callable

Adaickalavan commented 2 years ago

The following shape() error is fixed by pull request #1134. You may consider trying again and letting us know.

File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/sumo_road_network.py", line 839, in geometry for road in self.roads
File "/home/avt/SMARTS/ultra/.venv/lib/python3.7/site-packages/smarts/core/sumo_road_network.py", line 839, in for road in self.roads
TypeError: shape() takes from 1 to 2 positional arguments but 3 were given
Adaickalavan commented 2 years ago

In summary,

  1. A shape() error in SMARTS was fixed by pull request #1134.
  2. Following the instructions in ultra-develop branch, runs the training, connects to Envision, and displays the simulation in Envision.
    • Setup: /SMARTS/ultra/docs/setup.md
    • Getting started: /SMARTS/ultra/docs/getting_started.md
  3. Going forward, for all ultra matters, please use the ultra-develop branch.

Given that the problem appears resolved, this issue is being closed.

Feel free to reopen this issue if this problem persists, or a open a new issue for other problems.