Closed jayanthnair closed 1 year ago
@jayanthnair this seems to be a networking issue. I wasn't able to build a docker image using the file you provided. But perhaps you can try to just drop "-h", "0.0.0.0" when starting Ray Serve?
@GeneDer I tried that. Still getting the same issue. Curiously, the moment I stop the container, it says Deployed Serve App successfully.
2023-07-05 16:35:30 2023-07-05 21:35:30,069 WARNING deprecation.py:50 -- DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:30 /usr/local/lib/python3.10/site-packages/gymnasium/spaces/box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
2023-07-05 16:35:30 logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2023-07-05 16:35:32 2023-07-05 21:35:32,142 WARNING services.py:1832 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=8.05gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-07-05 16:35:32 2023-07-05 21:35:32,275 INFO worker.py:1610 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
2023-07-05 16:35:34 (HTTPProxyActor pid=462) INFO: Started server process [462]
2023-07-05 16:35:35 (ServeController pid=435) INFO 2023-07-05 21:35:34,899 controller 435 deployment_state.py:1316 - Deploying new version of deployment default_ServePPOModel.
2023-07-05 16:35:35 (ServeController pid=435) INFO 2023-07-05 21:35:35,006 controller 435 deployment_state.py:1583 - Adding 1 replica to deployment default_ServePPOModel.
2023-07-05 16:35:37 (ServeReplica:default_ServePPOModel pid=494) DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:37,956 WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:37,957 WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) /usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py:484: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) `UnifiedLogger` will be removed in Ray 2.7.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) return UnifiedLogger(config, logdir, loggers=None)
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:37,985 WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:37,993 WARNING deprecation.py:50 -- DeprecationWarning: `ValueNetworkMixin` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:37,993 WARNING deprecation.py:50 -- DeprecationWarning: `LearningRateSchedule` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:37,993 WARNING deprecation.py:50 -- DeprecationWarning: `EntropyCoeffSchedule` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:37,993 WARNING deprecation.py:50 -- DeprecationWarning: `KLCoeffMixin` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) Install gputil for GPU system monitoring.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) /usr/local/lib/python3.10/site-packages/gymnasium/spaces/box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:38,699 WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:38,700 WARNING algorithm_config.py:656 -- Cannot create PPOConfig from given `config_dict`! Property cluster_name not supported.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:38,702 WARNING policy.py:1065 -- `observation_space` in given policy state (Box(-inf, inf, (3,), float32)) does not match this Policy's observation space (Box(0.0, [3.e+04 2.e+01 3.e+02], (3,), float32)).
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) Restored on 172.17.0.2 from checkpoint: /src/inference_checkpoints/checkpoint_000020
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) Current state after restoring: {'_iteration': 20, '_timesteps_total': None, '_time_total': 319.60047125816345, '_episodes_total': 287}
2023-07-05 16:35:38 2023-07-05 21:35:38,837 INFO router.py:893 -- Using PowerOfTwoChoicesReplicaScheduler.
2023-07-05 16:35:38 2023-07-05 21:35:38,847 INFO router.py:370 -- Got updated replicas for deployment default_ServePPOModel: {'default_ServePPOModel#FdKEJy'}.
2023-07-05 16:35:39 (ServeController pid=435) INFO 2023-07-05 21:35:38,958 controller 435 deployment_state.py:1316 - Deploying new version of deployment default_ServePPOModel.
2023-07-05 16:35:39 2023-07-05 21:35:39,072 INFO router.py:370 -- Got updated replicas for deployment default_ServePPOModel: set().
2023-07-05 16:35:39 (ServeController pid=435) INFO 2023-07-05 21:35:39,068 controller 435 deployment_state.py:1466 - Stopping 1 replicas of deployment 'default_ServePPOModel' with outdated versions.
2023-07-05 16:35:41 (ServeController pid=435) INFO 2023-07-05 21:35:41,160 controller 435 deployment_state.py:1583 - Adding 1 replica to deployment default_ServePPOModel.
2023-07-05 16:35:43 (ServeReplica:default_ServePPOModel pid=576) DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,115 WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,116 WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) /usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py:484: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) `UnifiedLogger` will be removed in Ray 2.7.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) return UnifiedLogger(config, logdir, loggers=None)
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,141 WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,152 WARNING deprecation.py:50 -- DeprecationWarning: `ValueNetworkMixin` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,152 WARNING deprecation.py:50 -- DeprecationWarning: `LearningRateSchedule` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,152 WARNING deprecation.py:50 -- DeprecationWarning: `EntropyCoeffSchedule` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,152 WARNING deprecation.py:50 -- DeprecationWarning: `KLCoeffMixin` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) Install gputil for GPU system monitoring.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) /usr/local/lib/python3.10/site-packages/gymnasium/spaces/box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,826 WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,827 WARNING algorithm_config.py:656 -- Cannot create PPOConfig from given `config_dict`! Property cluster_name not supported.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,829 WARNING policy.py:1065 -- `observation_space` in given policy state (Box(-inf, inf, (3,), float32)) does not match this Policy's observation space (Box(0.0, [3.e+04 2.e+01 3.e+02], (3,), float32)).
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) Restored on 172.17.0.2 from checkpoint: /src/inference_checkpoints/checkpoint_000020
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) Current state after restoring: {'_iteration': 20, '_timesteps_total': None, '_time_total': 319.60047125816345, '_episodes_total': 287}
2023-07-05 16:35:44 2023-07-05 21:35:44,904 INFO router.py:370 -- Got updated replicas for deployment default_ServePPOModel: {'default_ServePPOModel#RUgRhr'}.
2023-07-05 16:36:32 (pid=gcs_server) [2023-07-05 21:36:32,132 E 36 36] (gcs_server) gcs_job_manager.cc:227: Failed to get is_running_tasks from core worker: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
@jayanthnair interesting. Maybe it's able to deploy to the head node, but failed to connect to the worker node. Maybe you can try to add RUN ray start --head
in the dockerfile right before Serve run command? I think this should start just a ray head node and have Serve deploy onto that only.
@GeneDer Seems like it can't connect to the head node.
# Error message
2023-07-05 16:46:49 2023-07-05 21:46:49,345 WARNING deprecation.py:50 -- DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
2023-07-05 16:46:49 /usr/local/lib/python3.10/site-packages/gymnasium/spaces/box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
2023-07-05 16:46:49 logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2023-07-05 16:46:49 2023-07-05 21:46:49,643 INFO worker.py:1429 -- Connecting to existing Ray cluster at address: 172.17.0.2:6379...
2023-07-05 16:46:54 2023-07-05 21:46:54,656 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:46:54 2023-07-05 21:46:54,656 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:01 2023-07-05 21:47:01,671 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:01 2023-07-05 21:47:01,671 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:08 2023-07-05 21:47:08,687 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:08 2023-07-05 21:47:08,687 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:15 2023-07-05 21:47:15,701 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:15 2023-07-05 21:47:15,701 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:22 2023-07-05 21:47:22,714 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:22 2023-07-05 21:47:22,714 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:29 2023-07-05 21:47:29,727 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:29 2023-07-05 21:47:29,727 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:36 2023-07-05 21:47:36,739 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:36 2023-07-05 21:47:36,739 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:43 2023-07-05 21:47:43,752 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:43 2023-07-05 21:47:43,752 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:50 2023-07-05 21:47:50,768 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:50 2023-07-05 21:47:50,768 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:57 2023-07-05 21:47:57,784 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:57 2023-07-05 21:47:57,785 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:48:04 2023-07-05 21:48:04,799 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:48:04 2023-07-05 21:48:04,799 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:48:11 2023-07-05 21:48:11,814 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:48:11 2023-07-05 21:48:11,815 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:48:18 2023-07-05 21:48:18,830 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:48:18 2023-07-05 21:48:18,830 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:48:25 2023-07-05 21:48:25,842 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:48:25 2023-07-05 21:48:25,842 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:48:32 2023-07-05 21:48:32,857 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:48:32 2023-07-05 21:48:32,857 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:48:39 2023-07-05 21:48:39,872 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:48:39 2023-07-05 21:48:39,872 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:48:46 2023-07-05 21:48:46,885 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:48:46 2023-07-05 21:48:46,885 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:48:53 2023-07-05 21:48:53,900 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:48:53 2023-07-05 21:48:53,900 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:49:00 2023-07-05 21:49:00,918 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:49:00 2023-07-05 21:49:00,919 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:49:07 2023-07-05 21:49:07,933 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:49:07 2023-07-05 21:49:07,934 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:49:09 2023-07-05 21:49:09,936 INFO worker.py:1584 -- Failed to connect to the default Ray cluster address at 172.17.0.2:6379. This is most likely due to a previous Ray instance that has since crashed. To reset the default address to connect to, run `ray stop` or restart Ray with `ray start`.
2023-07-05 16:49:09 Traceback (most recent call last):
2023-07-05 16:49:09 File "/usr/local/bin/serve", line 8, in <module>
2023-07-05 16:49:09 sys.exit(cli())
2023-07-05 16:49:09 File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
2023-07-05 16:49:09 2023-07-05 21:46:47,649 INFO scripts.py:407 -- Running import path: 'serve_agent:agent'.
2023-07-05 16:49:09 return self.main(*args, **kwargs)
2023-07-05 16:49:09 File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
2023-07-05 16:49:09 rv = self.invoke(ctx)
2023-07-05 16:49:09 File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
2023-07-05 16:49:09 return _process_result(sub_ctx.command.invoke(sub_ctx))
2023-07-05 16:49:09 File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
2023-07-05 16:49:09 return ctx.invoke(self.callback, **ctx.params)
2023-07-05 16:49:09 File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
2023-07-05 16:49:09 return __callback(*args, **kwargs)
2023-07-05 16:49:09 File "/usr/local/lib/python3.10/site-packages/ray/serve/scripts.py", line 409, in run
2023-07-05 16:49:09 import_attr(import_path), args_dict
2023-07-05 16:49:09 File "/usr/local/lib/python3.10/site-packages/ray/_private/utils.py", line 1190, in import_attr
2023-07-05 16:49:09 module = importlib.import_module(module_name)
2023-07-05 16:49:09 File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module
2023-07-05 16:49:09 return _bootstrap._gcd_import(name[level:], package, level)
2023-07-05 16:49:09 File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
2023-07-05 16:49:09 File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
2023-07-05 16:49:09 File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
2023-07-05 16:49:09 File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
2023-07-05 16:49:09 File "<frozen importlib._bootstrap_external>", line 883, in exec_module
2023-07-05 16:49:09 File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
2023-07-05 16:49:09 File "/src/./serve_agent.py", line 58, in <module>
2023-07-05 16:49:09 serve.run(agent)
2023-07-05 16:49:09 File "/usr/local/lib/python3.10/site-packages/ray/serve/api.py", line 447, in run
2023-07-05 16:49:09 client = _private_api.serve_start(
2023-07-05 16:49:09 File "/usr/local/lib/python3.10/site-packages/ray/serve/_private/api.py", line 299, in serve_start
2023-07-05 16:49:09 client = get_global_client(_health_check_controller=True)
2023-07-05 16:49:09 File "/usr/local/lib/python3.10/site-packages/ray/serve/context.py", line 59, in get_global_client
2023-07-05 16:49:09 return _connect()
2023-07-05 16:49:09 File "/usr/local/lib/python3.10/site-packages/ray/serve/context.py", line 105, in _connect
2023-07-05 16:49:09 ray.init(namespace=SERVE_NAMESPACE)
2023-07-05 16:49:09 File "/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
2023-07-05 16:49:09 return func(*args, **kwargs)
2023-07-05 16:49:09 File "/usr/local/lib/python3.10/site-packages/ray/_private/worker.py", line 1575, in init
2023-07-05 16:49:09 _global_node = ray._private.node.Node(
2023-07-05 16:49:09 File "/usr/local/lib/python3.10/site-packages/ray/_private/node.py", line 186, in __init__
2023-07-05 16:49:09 session_name = ray._private.utils.internal_kv_get_with_retry(
2023-07-05 16:49:09 File "/usr/local/lib/python3.10/site-packages/ray/_private/utils.py", line 1412, in internal_kv_get_with_retry
2023-07-05 16:49:09 raise ConnectionError(
2023-07-05 16:49:09 ConnectionError: Could not read 'session_name' from GCS. Did GCS start successfully?
Also Dockerfile for reference:
# Dockerfile
FROM python:3.10.11
# Install libraries and dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
RUN apt-get update && \
apt-get install -y --no-install-recommends
WORKDIR /src
COPY requirements.txt /src
RUN pip3 install -r requirements.txt
COPY . /src
WORKDIR /src
RUN ["ray", "start", "--head"]
CMD ["serve", "run", "serve_agent:agent"]
hmm this is also interesting, so there is already ray instances running. Maybe try RUN ray stop && ray start --head
?
Still seeing something similar. I've tried both
docker run -p 8000:8000 rl-agent
and
docker run rl-agent
Error message
$ docker run rl-agent
2023-07-05 21:59:12,880 WARNING deprecation.py:50 -- DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
/usr/local/lib/python3.10/site-packages/gymnasium/spaces/box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2023-07-05 21:59:13,187 INFO worker.py:1429 -- Connecting to existing Ray cluster at address: 172.17.0.3:6379...
2023-07-05 21:59:18,799 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 21:59:18,800 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.3:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 21:59:26,314 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 21:59:26,314 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.3:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
@jayanthnair so I got something running
# Dockerfile
FROM python:3.10.11
# Install libraries and dependencies
RUN apt-get update && \
apt-get install -y --no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
RUN apt-get update && \
apt-get install -y --no-install-recommends
WORKDIR /src
COPY requirements.txt /src
RUN pip install --upgrade pip
RUN pip install -r requirements.txt
COPY . /src
WORKDIR /src
CMD ["serve", "run", "serve_agent:agent"]
# requirements.txt
gymnasium==0.26.3
numpy==1.24.3
pandas==2.0.2
ray[data,rllib,serve] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_aarch64.whl
torch==2.0.1
starlette==0.27.0
dm-tree==0.1.8
# azureml-mlflow
# azureml-defaults
# serve_agent.py
import ray.rllib.algorithms.ppo as ppo
from pathlib import Path
from ray import serve
from starlette.requests import Request
folder_path = "checkpoint_000001"
PATH_TO_CHECKPOINT = Path(__file__).absolute().parent / "rllib_checkpoint" / folder_path
@serve.deployment
class ServePPOModel:
def __init__(self, checkpoint_path) -> None:
# Re-create the originally used config.
config = ppo.PPOConfig() \
.framework("torch") \
.rollouts(num_rollout_workers=0)
# Build the Algorithm instance using the config.
self.algorithm = config.build(env="CartPole-v0")
# Restore the algo's state from the checkpoint.
self.algorithm.restore(checkpoint_path)
async def __call__(self, request: Request):
json_input = await request.json()
obs = json_input["observation"]
action = self.algorithm.compute_single_action(obs)
return {"action": int(action)}
agent = ServePPOModel.bind(PATH_TO_CHECKPOINT)
serve.run(agent)
# client.py
# Note: `gymnasium` (not `gym`) will be **the** API supported by RLlib from Ray 2.3 on.
try:
import gymnasium as gym
gymnasium = True
except Exception:
import gym
gymnasium = False
import requests
env = gym.make("CartPole-v1")
for _ in range(5):
if gymnasium:
obs, infos = env.reset()
else:
obs = env.reset()
print(f"-> Sending observation {obs}")
resp = requests.get(
"http://localhost:8000/", json={"observation": obs.tolist()}
)
print(f"<- Received response {resp.json()}")
obs, infos = env.reset()
print(f"-> Sending observation {obs}")
resp = requests.get(
"http://localhost:8000/", json={"observation": obs.tolist()}
)
print(f"<- Received response {resp.json()}")
File structure looks like this
- /rllib_checkpoint
- /checkpoint_000001
- client.py
- Dockerfile
- requirements.txt
- serve_agent.py
I was able to see the success response ran the following
docker build . -t rl-agent:latest
to create a docker imagedocker run -it --rm rl-agent:latest
to start the servicedocker ps
to get the container iddocker exec -it c216b55a1b59 bash
and run the client code client.py
Can you try those and let me know if this works?
Hi @GeneDer it seems when I follow these steps I can get the Serve deployment to work properly. Based on my understanding then, the served agent can be queried at port 8000 of the docker container, from within the docker container, for responses. But my end goal is to be able to query from outside of the docker container.
When I publish port 8000 of the container to port 8000 of my local machine and try to query it, it is giving me a connectionerror like below.
# docker run command
docker run -it --rm -d -p 8000:8000 rl-agent:latest
# error message
Jayanth.Nair@HZXS1Z2 MINGW64 ~/Desktop/drl_workflow/drl_working_group (deploytest)
$ python client.py
2023-07-06 09:18:31,966 WARNING deprecation.py:50 -- DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
{'start_level': 1, 'obs_normalization': True, 'reward_normalization': False, 'states': {'volume': [0, 30000], 'ca': [0, 200], 'rtime': [0, 500]}, 'actions': {'qa': [0, 10], 'qs': [0, 200]}, 'configvars': {'qout': [75, 250]}}
C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\gymnasium\spaces\box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
-> Sending observation [-0.2 -0.94652086 -0.7159825 ]
Traceback (most recent call last):
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 714, in urlopen
httplib_response = self._make_request(
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 466, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 461, in _make_request
httplib_response = conn.getresponse()
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 1375, in getresponse
response.begin()
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 318, in begin
version, status, reason = self._read_status()
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 287, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\adapters.py", line 486, in send
resp = conn.urlopen(
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 798, in urlopen
retries = retries.increment(
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\util\retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\packages\six.py", line 769, in reraise
raise value.with_traceback(tb)
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 714, in urlopen
httplib_response = self._make_request(
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 466, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 461, in _make_request
httplib_response = conn.getresponse()
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 1375, in getresponse
response.begin()
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 318, in begin
version, status, reason = self._read_status()
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 287, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\jAYANTH.NAIR\Desktop\drl_workflow\drl_working_group\client.py", line 24, in <module>
resp = requests.get(
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\adapters.py", line 501, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
It seems the server is actively closing the connection?
@jayanthnair I think I found the issue. Setting the host to 0.0.0.0 is required for docker to expose the service to public seems like. I changed the last line of Dockerfile into
ENTRYPOINT [ "ray", "start", "--head", "--port=6379", "--redis-shard-ports=6380,6381", "--object-manager-port=22345","--node-manager-port=22346","--dashboard-host=0.0.0.0","--block"]
And run the following docker commands:
docker build . -t rl-agent:latest
to rebuild the imagedocker run -it --rm -p 8000:8000 -p 8265:8265 rl-agent:latest
to start head node and expose port 8000 and 8265docker ps
to get the container iddocker exec -it <container_id> serve start --http-host "0.0.0.0"
to start Ray Serve. You should be able to view Serve started and have host 0.0.0.0 at http://localhost:8265/#/serve/systemdocker exec -it <container_id> serve run serve_agent:agent
to run the service. Now you should be able to query like you normally would. Let me know if those helps! I would continue to debug on why serve run did not pass the host correctly and possibly file a separate bug ticket for it.
Related question: https://discuss.ray.io/t/what-is-best-practice-for-local-setup/6507
Thanks a lot @GeneDer this has solved the issue! I will file a separate bug ticket as you suggested as well.
I have also encountered the issue with your error report. Have you resolved it?
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 17:54:10,180 WARNING algorithm_config.py:2534 -- Setting exploration_config={}
because you set _enable_rl_module_api=True
. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the forward_exploration
method of the RLModule at hand. On configs that have a default exploration config, this must be done with config.exploration_config={}
.
@lyzyn I'm not sure if I have context for your issue. Can you file a separate issue and fill in the details? Also, the way you described it, it seems to be a rllib issue instead of serve?
What happened + What you expected to happen
Following up after my issue - 37042 was resolved, I was able to run serve locally on my terminal and get responses. However, when I export the agent in a docker container (essentially exporting the same repo), I get the following message,
The container hangs at the last message and I do not get the deployed successfully message.
Additionally, this is the dockerfile I use to create the container:
Versions / Dependencies
Reproduction script
Issue Severity
High: It blocks me from completing my task.