ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.56k stars 5.7k forks source link

[<Ray component: Serve>] Running serve inside deployment docker container gives GrpcUnavailable error #37107

Closed jayanthnair closed 1 year ago

jayanthnair commented 1 year ago

What happened + What you expected to happen

Following up after my issue - 37042 was resolved, I was able to run serve locally on my terminal and get responses. However, when I export the agent in a docker container (essentially exporting the same repo), I get the following message,

2023-07-05 12:54:01 2023-07-05 17:54:01,301     WARNING deprecation.py:50 -- DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
2023-07-05 12:54:01 /usr/local/lib/python3.10/site-packages/gymnasium/spaces/box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
2023-07-05 12:54:01   logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2023-07-05 12:54:03 2023-07-05 17:54:03,740     WARNING services.py:1832 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=8.04gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-07-05 12:54:03 2023-07-05 17:54:03,870     INFO worker.py:1610 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
2023-07-05 12:54:06 (HTTPProxyActor pid=462) INFO:     Started server process [462]
2023-07-05 12:54:06 (ServeController pid=436) INFO 2023-07-05 17:54:06,850 controller 436 deployment_state.py:1316 - Deploying new version of deployment default_ServePPOModel.
2023-07-05 12:54:07 (ServeController pid=436) INFO 2023-07-05 17:54:06,955 controller 436 deployment_state.py:1583 - Adding 1 replica to deployment default_ServePPOModel.
2023-07-05 12:54:09 (ServeReplica:default_ServePPOModel pid=494) DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 17:54:10,180        WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 17:54:10,181        WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) /usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py:484: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) `UnifiedLogger` will be removed in Ray 2.7.
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494)   return UnifiedLogger(config, logdir, loggers=None)
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494)   self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494)   self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494)   self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 17:54:10,225        WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 17:54:10,247        WARNING deprecation.py:50 -- DeprecationWarning: `ValueNetworkMixin` has been deprecated. This will raise an error in the future!
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 17:54:10,247        WARNING deprecation.py:50 -- DeprecationWarning: `LearningRateSchedule` has been deprecated. This will raise an error in the future!
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 17:54:10,247        WARNING deprecation.py:50 -- DeprecationWarning: `EntropyCoeffSchedule` has been deprecated. This will raise an error in the future!
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 17:54:10,247        WARNING deprecation.py:50 -- DeprecationWarning: `KLCoeffMixin` has been deprecated. This will raise an error in the future!
2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) Install gputil for GPU system monitoring.
2023-07-05 12:54:11 (ServeReplica:default_ServePPOModel pid=494) /usr/local/lib/python3.10/site-packages/gymnasium/spaces/box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
2023-07-05 12:54:11 (ServeReplica:default_ServePPOModel pid=494)   logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2023-07-05 12:54:11 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 17:54:11,506        WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 12:54:11 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 17:54:11,507        WARNING algorithm_config.py:656 -- Cannot create PPOConfig from given `config_dict`! Property cluster_name not supported.
2023-07-05 12:54:11 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 17:54:11,509        WARNING policy.py:1065 -- `observation_space` in given policy state (Box(-inf, inf, (3,), float32)) does not match this Policy's observation space (Box(0.0, [3.e+04 2.e+01 3.e+02], (3,), float32)).
2023-07-05 12:54:11 (ServeReplica:default_ServePPOModel pid=494) Restored on 172.17.0.2 from checkpoint: /src/inference_checkpoints/checkpoint_000020
2023-07-05 12:54:11 (ServeReplica:default_ServePPOModel pid=494) Current state after restoring: {'_iteration': 20, '_timesteps_total': None, '_time_total': 319.60047125816345, '_episodes_total': 287}
2023-07-05 12:54:11 2023-07-05 17:54:11,780     INFO router.py:893 -- Using PowerOfTwoChoicesReplicaScheduler.
2023-07-05 12:54:11 2023-07-05 17:54:11,788     INFO router.py:370 -- Got updated replicas for deployment default_ServePPOModel: {'default_ServePPOModel#yYTEqz'}.
2023-07-05 12:54:11 The new client HTTP config differs from the existing one in the following fields: ['host']. The new HTTP config is ignored.
2023-07-05 12:54:11 The new client HTTP config differs from the existing one in the following fields: ['host']. The new HTTP config is ignored.
2023-07-05 12:54:11 (ServeController pid=436) INFO 2023-07-05 17:54:11,822 controller 436 deployment_state.py:1316 - Deploying new version of deployment default_ServePPOModel.
2023-07-05 12:54:11 2023-07-05 17:54:11,929     INFO router.py:370 -- Got updated replicas for deployment default_ServePPOModel: set().
2023-07-05 12:54:11 (ServeController pid=436) INFO 2023-07-05 17:54:11,927 controller 436 deployment_state.py:1466 - Stopping 1 replicas of deployment 'default_ServePPOModel' with outdated versions.
2023-07-05 12:54:14 (ServeController pid=436) INFO 2023-07-05 17:54:13,988 controller 436 deployment_state.py:1583 - Adding 1 replica to deployment default_ServePPOModel.
2023-07-05 12:54:16 (ServeReplica:default_ServePPOModel pid=576) DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 17:54:17,159        WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 17:54:17,160        WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) /usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py:484: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) `UnifiedLogger` will be removed in Ray 2.7.
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576)   return UnifiedLogger(config, logdir, loggers=None)
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576)   self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576)   self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576)   self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 17:54:17,185        WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 17:54:17,193        WARNING deprecation.py:50 -- DeprecationWarning: `ValueNetworkMixin` has been deprecated. This will raise an error in the future!
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 17:54:17,193        WARNING deprecation.py:50 -- DeprecationWarning: `LearningRateSchedule` has been deprecated. This will raise an error in the future!
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 17:54:17,193        WARNING deprecation.py:50 -- DeprecationWarning: `EntropyCoeffSchedule` has been deprecated. This will raise an error in the future!
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 17:54:17,193        WARNING deprecation.py:50 -- DeprecationWarning: `KLCoeffMixin` has been deprecated. This will raise an error in the future!
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) Install gputil for GPU system monitoring.
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) /usr/local/lib/python3.10/site-packages/gymnasium/spaces/box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576)   logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 17:54:17,932        WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 17:54:17,933        WARNING algorithm_config.py:656 -- Cannot create PPOConfig from given `config_dict`! Property cluster_name not supported.
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 17:54:17,936        WARNING policy.py:1065 -- `observation_space` in given policy state (Box(-inf, inf, (3,), float32)) does not match this Policy's observation space (Box(0.0, [3.e+04 2.e+01 3.e+02], (3,), float32)).
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) Restored on 172.17.0.2 from checkpoint: /src/inference_checkpoints/checkpoint_000020
2023-07-05 12:54:17 (ServeReplica:default_ServePPOModel pid=576) Current state after restoring: {'_iteration': 20, '_timesteps_total': None, '_time_total': 319.60047125816345, '_episodes_total': 287}
2023-07-05 12:54:18 2023-07-05 17:54:18,032     INFO router.py:370 -- Got updated replicas for deployment default_ServePPOModel: {'default_ServePPOModel#uOxDGA'}.
2023-07-05 12:55:03 (pid=gcs_server) [2023-07-05 17:55:03,761 E 36 36] (gcs_server) gcs_job_manager.cc:227: Failed to get is_running_tasks from core worker: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 

The container hangs at the last message and I do not get the deployed successfully message.

Additionally, this is the dockerfile I use to create the container:

# Dockerfile

FROM python:3.10.11
# Install libraries and dependencies
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

RUN apt-get update && \
    apt-get install -y --no-install-recommends

WORKDIR /src

COPY requirements.txt /src
RUN pip3 install -r requirements.txt

COPY . /src

WORKDIR /src

CMD ["serve", "run", "-h", "0.0.0.0", "serve_agent:agent"]
# requirements.txt

gymnasium==0.26.3
numpy==1.24.3
pandas==2.0.2
ray[data,rllib,serve] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl
torch==2.0.1
starlette==0.27.0
dm-tree==0.1.8
azureml-mlflow
azureml-defaults

Versions / Dependencies


adal==1.2.7
aiohttp==3.8.4
aiohttp-cors==0.7.0
aiorwlock==1.3.0
aiosignal==1.3.1
anyio==3.7.0
argcomplete==2.1.2
async-timeout==4.0.2
attrs==23.1.0
azure-common==1.1.28
azure-core==1.27.1
azure-graphrbac==0.61.1
azure-identity==1.13.0
azure-mgmt-authorization==3.0.0
azure-mgmt-containerregistry==10.1.0
azure-mgmt-core==1.4.0
azure-mgmt-keyvault==10.2.2
azure-mgmt-resource==21.2.1
azure-mgmt-storage==20.1.0
azure-storage-blob==12.13.0
azureml-core==1.48.0
azureml-dataprep==4.8.6
azureml-dataprep-native==38.0.0
azureml-dataprep-rslex==2.15.2
azureml-dataset-runtime==1.48.0
azureml-defaults==1.48.0
azureml-inference-server-http==0.7.7
azureml-mlflow==1.52.0
backports.tempfile==1.0
backports.weakref==1.0.post1
bcrypt==4.0.1
blessed==1.20.0
blinker==1.6.2
cachetools==5.3.1
certifi==2023.5.7
cffi==1.15.1
charset-normalizer==3.1.0
click==8.1.3
cloudpickle==2.2.1
cmake==3.26.4
colorful==0.5.5
contextlib2==21.6.0
cryptography==38.0.4
databricks-cli==0.17.7
distlib==0.3.6
distro==1.8.0
dm-tree==0.1.8
docker==6.1.3
dotnetcore2==3.1.23
entrypoints==0.4
exceptiongroup==1.1.2
fastapi==0.99.1
filelock==3.12.2
Flask==2.3.2
Flask-Cors==3.0.10
frozenlist==1.3.3
fsspec==2023.6.0
fusepy==3.0.1
gitdb==4.0.10
GitPython==3.1.31
google-api-core==2.11.1
google-auth==2.21.0
googleapis-common-protos==1.59.1
gpustat==1.1
grpcio==1.56.0
gunicorn==20.1.0
Gymnasium==0.26.3
gymnasium-notices==0.0.1
h11==0.14.0
humanfriendly==10.0
idna==3.4
imageio==2.31.1
importlib-metadata==6.7.0
inference-schema==1.5.1
isodate==0.6.1
itsdangerous==2.1.2
jeepney==0.8.0
Jinja2==3.1.2
jmespath==1.0.1
jsonpickle==2.2.0
jsonschema==4.17.3
knack==0.10.1
lazy_loader==0.3
lit==16.0.6
lz4==4.3.2
markdown-it-py==3.0.0
MarkupSafe==2.1.3
mdurl==0.1.2
mlflow-skinny==2.4.1
mpmath==1.3.0
msal==1.22.0
msal-extensions==1.0.0
msgpack==1.0.5
msrest==0.7.1
msrestazure==0.6.4
multidict==6.0.4
ndg-httpsclient==0.5.1
networkx==3.1
numpy==1.24.3
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-ml-py==11.525.131
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
oauthlib==3.2.2
opencensus==0.11.2
opencensus-context==0.1.3
opencensus-ext-azure==1.1.9
packaging==21.3
pandas==2.0.2
paramiko==2.12.0
pathspec==0.11.1
Pillow==10.0.0
pkginfo==1.9.6
platformdirs==3.8.0
portalocker==2.7.0
prometheus-client==0.17.0
protobuf==4.23.3
psutil==5.9.5
py-spy==0.3.14
pyarrow==9.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==1.10.11
Pygments==2.15.1
PyJWT==2.7.0
PyNaCl==1.5.0
pyOpenSSL==22.1.0
pyparsing==3.1.0
pyrsistent==0.19.3
PySocks==1.7.1
python-dateutil==2.8.2
pytz==2023.3
PyWavelets==1.4.1
PyYAML==6.0
ray @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl
requests==2.31.0
requests-oauthlib==1.3.1
rich==13.4.2
rsa==4.9
scikit-image==0.21.0
scipy==1.11.1
SecretStorage==3.3.3
six==1.16.0
smart-open==6.3.0
smmap==5.0.0
sniffio==1.3.0
sqlparse==0.4.4
starlette==0.27.0
sympy==1.12
tabulate==0.9.0
tensorboardX==2.6.1
tifffile==2023.7.4
torch==2.0.1
triton==2.0.0
typer==0.9.0
typing_extensions==4.7.1
tzdata==2023.3
urllib3==1.26.16
uvicorn==0.22.0
virtualenv==20.21.0
wcwidth==0.2.6
websocket-client==1.6.1
Werkzeug==2.3.6
wrapt==1.12.1
yarl==1.9.2
zipp==3.15.0

Reproduction script

# serve_agent.py

from starlette.requests import Request
import ray.rllib.algorithms.ppo as ppo
from ray import serve
import gymnasium.spaces as spaces
import numpy as np
from pathlib import Path
import requests
# from ray.tune.registry import register_env
# from mixing_environment.mixing_env.env_creator import mixing_env_creator
# register_env("mixing_environment", mixing_env_creator)
# Update this to match location of checkpoint
folder_path = "checkpoint_000020"
PATH_TO_CHECKPOINT = Path(__file__).absolute().parent / "inference_checkpoints" / folder_path

# Update this to reflect observation and action space in environment
observation_space = spaces.Box(
            low=np.array([0, 0, 0]),
            high=np.array([30000, 20, 300]),
            shape=(3,),
            dtype=np.float32,
        )
action_space = spaces.Box(
            low=np.array([0, 0]),
            high=np.array([10, 200]),
            shape=(2,),
            dtype=np.float32,
        )
@serve.deployment
class ServePPOModel:
    """
    Class which defines how the model is served

    args:
        - checkpoint_path (str): path to the checkpoint

    """
    def __init__(self, checkpoint_path) -> None:
        # Re-create the originally used config. - NOTE this assumes PPO, update for a different algorithm as needed
        config = ppo.PPOConfig()\
            .framework("torch")\
            .rollouts(num_rollout_workers=0)

        # Build the Algorithm instance using the config. env=None since it is not needed for inference
        self.algorithm = config.environment(env=None,observation_space=observation_space,action_space=action_space).build()
        # self.algorithm = config.build(env="mixing_environment")
        # Restore the algorithm state from the checkpoint
        self.algorithm.restore(checkpoint_path)
        print('restored!')

    async def __call__(self, request: Request):
        json_input = await request.json()
        obs = json_input["observation"] #observation is the key, the list of states are the value in the dictionary we send as data
        action = self.algorithm.compute_single_action(obs)
        return {"action": action}

agent = ServePPOModel.bind(PATH_TO_CHECKPOINT)
serve.run(agent)

Issue Severity

High: It blocks me from completing my task.

GeneDer commented 1 year ago

@jayanthnair this seems to be a networking issue. I wasn't able to build a docker image using the file you provided. But perhaps you can try to just drop "-h", "0.0.0.0" when starting Ray Serve?

jayanthnair commented 1 year ago

@GeneDer I tried that. Still getting the same issue. Curiously, the moment I stop the container, it says Deployed Serve App successfully.

2023-07-05 16:35:30 2023-07-05 21:35:30,069     WARNING deprecation.py:50 -- DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:30 /usr/local/lib/python3.10/site-packages/gymnasium/spaces/box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
2023-07-05 16:35:30   logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2023-07-05 16:35:32 2023-07-05 21:35:32,142     WARNING services.py:1832 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=8.05gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-07-05 16:35:32 2023-07-05 21:35:32,275     INFO worker.py:1610 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
2023-07-05 16:35:34 (HTTPProxyActor pid=462) INFO:     Started server process [462]
2023-07-05 16:35:35 (ServeController pid=435) INFO 2023-07-05 21:35:34,899 controller 435 deployment_state.py:1316 - Deploying new version of deployment default_ServePPOModel.
2023-07-05 16:35:35 (ServeController pid=435) INFO 2023-07-05 21:35:35,006 controller 435 deployment_state.py:1583 - Adding 1 replica to deployment default_ServePPOModel.
2023-07-05 16:35:37 (ServeReplica:default_ServePPOModel pid=494) DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:37,956        WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:37,957        WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) /usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py:484: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) `UnifiedLogger` will be removed in Ray 2.7.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494)   return UnifiedLogger(config, logdir, loggers=None)
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494)   self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494)   self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494)   self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:37,985        WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:37,993        WARNING deprecation.py:50 -- DeprecationWarning: `ValueNetworkMixin` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:37,993        WARNING deprecation.py:50 -- DeprecationWarning: `LearningRateSchedule` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:37,993        WARNING deprecation.py:50 -- DeprecationWarning: `EntropyCoeffSchedule` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:37,993        WARNING deprecation.py:50 -- DeprecationWarning: `KLCoeffMixin` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) Install gputil for GPU system monitoring.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) /usr/local/lib/python3.10/site-packages/gymnasium/spaces/box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494)   logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:38,699        WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:38,700        WARNING algorithm_config.py:656 -- Cannot create PPOConfig from given `config_dict`! Property cluster_name not supported.
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 21:35:38,702        WARNING policy.py:1065 -- `observation_space` in given policy state (Box(-inf, inf, (3,), float32)) does not match this Policy's observation space (Box(0.0, [3.e+04 2.e+01 3.e+02], (3,), float32)).
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) Restored on 172.17.0.2 from checkpoint: /src/inference_checkpoints/checkpoint_000020
2023-07-05 16:35:38 (ServeReplica:default_ServePPOModel pid=494) Current state after restoring: {'_iteration': 20, '_timesteps_total': None, '_time_total': 319.60047125816345, '_episodes_total': 287}
2023-07-05 16:35:38 2023-07-05 21:35:38,837     INFO router.py:893 -- Using PowerOfTwoChoicesReplicaScheduler.
2023-07-05 16:35:38 2023-07-05 21:35:38,847     INFO router.py:370 -- Got updated replicas for deployment default_ServePPOModel: {'default_ServePPOModel#FdKEJy'}.
2023-07-05 16:35:39 (ServeController pid=435) INFO 2023-07-05 21:35:38,958 controller 435 deployment_state.py:1316 - Deploying new version of deployment default_ServePPOModel.
2023-07-05 16:35:39 2023-07-05 21:35:39,072     INFO router.py:370 -- Got updated replicas for deployment default_ServePPOModel: set().
2023-07-05 16:35:39 (ServeController pid=435) INFO 2023-07-05 21:35:39,068 controller 435 deployment_state.py:1466 - Stopping 1 replicas of deployment 'default_ServePPOModel' with outdated versions.
2023-07-05 16:35:41 (ServeController pid=435) INFO 2023-07-05 21:35:41,160 controller 435 deployment_state.py:1583 - Adding 1 replica to deployment default_ServePPOModel.
2023-07-05 16:35:43 (ServeReplica:default_ServePPOModel pid=576) DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,115        WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,116        WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) /usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py:484: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) `UnifiedLogger` will be removed in Ray 2.7.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576)   return UnifiedLogger(config, logdir, loggers=None)
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576)   self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576)   self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) /usr/local/lib/python3.10/site-packages/ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576)   self._loggers.append(cls(self.config, self.logdir, self.trial))
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,141        WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,152        WARNING deprecation.py:50 -- DeprecationWarning: `ValueNetworkMixin` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,152        WARNING deprecation.py:50 -- DeprecationWarning: `LearningRateSchedule` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,152        WARNING deprecation.py:50 -- DeprecationWarning: `EntropyCoeffSchedule` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,152        WARNING deprecation.py:50 -- DeprecationWarning: `KLCoeffMixin` has been deprecated. This will raise an error in the future!
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) Install gputil for GPU system monitoring.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) /usr/local/lib/python3.10/site-packages/gymnasium/spaces/box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576)   logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,826        WARNING algorithm_config.py:2534 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,827        WARNING algorithm_config.py:656 -- Cannot create PPOConfig from given `config_dict`! Property cluster_name not supported.
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) 2023-07-05 21:35:44,829        WARNING policy.py:1065 -- `observation_space` in given policy state (Box(-inf, inf, (3,), float32)) does not match this Policy's observation space (Box(0.0, [3.e+04 2.e+01 3.e+02], (3,), float32)).
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) Restored on 172.17.0.2 from checkpoint: /src/inference_checkpoints/checkpoint_000020
2023-07-05 16:35:44 (ServeReplica:default_ServePPOModel pid=576) Current state after restoring: {'_iteration': 20, '_timesteps_total': None, '_time_total': 319.60047125816345, '_episodes_total': 287}
2023-07-05 16:35:44 2023-07-05 21:35:44,904     INFO router.py:370 -- Got updated replicas for deployment default_ServePPOModel: {'default_ServePPOModel#RUgRhr'}.
2023-07-05 16:36:32 (pid=gcs_server) [2023-07-05 21:36:32,132 E 36 36] (gcs_server) gcs_job_manager.cc:227: Failed to get is_running_tasks from core worker: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: 
GeneDer commented 1 year ago

@jayanthnair interesting. Maybe it's able to deploy to the head node, but failed to connect to the worker node. Maybe you can try to add RUN ray start --head in the dockerfile right before Serve run command? I think this should start just a ray head node and have Serve deploy onto that only.

jayanthnair commented 1 year ago

@GeneDer Seems like it can't connect to the head node.

# Error message
2023-07-05 16:46:49 2023-07-05 21:46:49,345     WARNING deprecation.py:50 -- DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
2023-07-05 16:46:49 /usr/local/lib/python3.10/site-packages/gymnasium/spaces/box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
2023-07-05 16:46:49   logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2023-07-05 16:46:49 2023-07-05 21:46:49,643     INFO worker.py:1429 -- Connecting to existing Ray cluster at address: 172.17.0.2:6379...
2023-07-05 16:46:54 2023-07-05 21:46:54,656     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:46:54 2023-07-05 21:46:54,656     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:01 2023-07-05 21:47:01,671     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:01 2023-07-05 21:47:01,671     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:08 2023-07-05 21:47:08,687     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:08 2023-07-05 21:47:08,687     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:15 2023-07-05 21:47:15,701     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:15 2023-07-05 21:47:15,701     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:22 2023-07-05 21:47:22,714     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:22 2023-07-05 21:47:22,714     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:29 2023-07-05 21:47:29,727     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:29 2023-07-05 21:47:29,727     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:36 2023-07-05 21:47:36,739     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:36 2023-07-05 21:47:36,739     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:43 2023-07-05 21:47:43,752     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:43 2023-07-05 21:47:43,752     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:50 2023-07-05 21:47:50,768     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:50 2023-07-05 21:47:50,768     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:47:57 2023-07-05 21:47:57,784     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:47:57 2023-07-05 21:47:57,785     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:48:04 2023-07-05 21:48:04,799     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:48:04 2023-07-05 21:48:04,799     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:48:11 2023-07-05 21:48:11,814     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:48:11 2023-07-05 21:48:11,815     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:48:18 2023-07-05 21:48:18,830     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:48:18 2023-07-05 21:48:18,830     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:48:25 2023-07-05 21:48:25,842     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:48:25 2023-07-05 21:48:25,842     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:48:32 2023-07-05 21:48:32,857     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:48:32 2023-07-05 21:48:32,857     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:48:39 2023-07-05 21:48:39,872     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:48:39 2023-07-05 21:48:39,872     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:48:46 2023-07-05 21:48:46,885     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:48:46 2023-07-05 21:48:46,885     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:48:53 2023-07-05 21:48:53,900     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:48:53 2023-07-05 21:48:53,900     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:49:00 2023-07-05 21:49:00,918     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:49:00 2023-07-05 21:49:00,919     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:49:07 2023-07-05 21:49:07,933     ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 16:49:07 2023-07-05 21:49:07,934     WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.2:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 16:49:09 2023-07-05 21:49:09,936     INFO worker.py:1584 -- Failed to connect to the default Ray cluster address at 172.17.0.2:6379. This is most likely due to a previous Ray instance that has since crashed. To reset the default address to connect to, run `ray stop` or restart Ray with `ray start`.
2023-07-05 16:49:09 Traceback (most recent call last):
2023-07-05 16:49:09   File "/usr/local/bin/serve", line 8, in <module>
2023-07-05 16:49:09     sys.exit(cli())
2023-07-05 16:49:09   File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
2023-07-05 16:49:09 2023-07-05 21:46:47,649     INFO scripts.py:407 -- Running import path: 'serve_agent:agent'.
2023-07-05 16:49:09     return self.main(*args, **kwargs)
2023-07-05 16:49:09   File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
2023-07-05 16:49:09     rv = self.invoke(ctx)
2023-07-05 16:49:09   File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
2023-07-05 16:49:09     return _process_result(sub_ctx.command.invoke(sub_ctx))
2023-07-05 16:49:09   File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
2023-07-05 16:49:09     return ctx.invoke(self.callback, **ctx.params)
2023-07-05 16:49:09   File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
2023-07-05 16:49:09     return __callback(*args, **kwargs)
2023-07-05 16:49:09   File "/usr/local/lib/python3.10/site-packages/ray/serve/scripts.py", line 409, in run
2023-07-05 16:49:09     import_attr(import_path), args_dict
2023-07-05 16:49:09   File "/usr/local/lib/python3.10/site-packages/ray/_private/utils.py", line 1190, in import_attr
2023-07-05 16:49:09     module = importlib.import_module(module_name)
2023-07-05 16:49:09   File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module
2023-07-05 16:49:09     return _bootstrap._gcd_import(name[level:], package, level)
2023-07-05 16:49:09   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
2023-07-05 16:49:09   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
2023-07-05 16:49:09   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
2023-07-05 16:49:09   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
2023-07-05 16:49:09   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
2023-07-05 16:49:09   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
2023-07-05 16:49:09   File "/src/./serve_agent.py", line 58, in <module>
2023-07-05 16:49:09     serve.run(agent)
2023-07-05 16:49:09   File "/usr/local/lib/python3.10/site-packages/ray/serve/api.py", line 447, in run
2023-07-05 16:49:09     client = _private_api.serve_start(
2023-07-05 16:49:09   File "/usr/local/lib/python3.10/site-packages/ray/serve/_private/api.py", line 299, in serve_start
2023-07-05 16:49:09     client = get_global_client(_health_check_controller=True)
2023-07-05 16:49:09   File "/usr/local/lib/python3.10/site-packages/ray/serve/context.py", line 59, in get_global_client
2023-07-05 16:49:09     return _connect()
2023-07-05 16:49:09   File "/usr/local/lib/python3.10/site-packages/ray/serve/context.py", line 105, in _connect
2023-07-05 16:49:09     ray.init(namespace=SERVE_NAMESPACE)
2023-07-05 16:49:09   File "/usr/local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
2023-07-05 16:49:09     return func(*args, **kwargs)
2023-07-05 16:49:09   File "/usr/local/lib/python3.10/site-packages/ray/_private/worker.py", line 1575, in init
2023-07-05 16:49:09     _global_node = ray._private.node.Node(
2023-07-05 16:49:09   File "/usr/local/lib/python3.10/site-packages/ray/_private/node.py", line 186, in __init__
2023-07-05 16:49:09     session_name = ray._private.utils.internal_kv_get_with_retry(
2023-07-05 16:49:09   File "/usr/local/lib/python3.10/site-packages/ray/_private/utils.py", line 1412, in internal_kv_get_with_retry
2023-07-05 16:49:09     raise ConnectionError(
2023-07-05 16:49:09 ConnectionError: Could not read 'session_name' from GCS. Did GCS start successfully?

Also Dockerfile for reference:

# Dockerfile

FROM python:3.10.11
# Install libraries and dependencies
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

RUN apt-get update && \
    apt-get install -y --no-install-recommends

WORKDIR /src

COPY requirements.txt /src
RUN pip3 install -r requirements.txt

COPY . /src

WORKDIR /src
RUN ["ray", "start", "--head"]
CMD ["serve", "run", "serve_agent:agent"]
GeneDer commented 1 year ago

hmm this is also interesting, so there is already ray instances running. Maybe try RUN ray stop && ray start --head?

jayanthnair commented 1 year ago

Still seeing something similar. I've tried both

docker run -p 8000:8000 rl-agent

and

docker run rl-agent

Error message

$ docker run rl-agent
2023-07-05 21:59:12,880 WARNING deprecation.py:50 -- DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
/usr/local/lib/python3.10/site-packages/gymnasium/spaces/box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2023-07-05 21:59:13,187 INFO worker.py:1429 -- Connecting to existing Ray cluster at address: 172.17.0.3:6379...
2023-07-05 21:59:18,799 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 21:59:18,800 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.3:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
2023-07-05 21:59:26,314 ERROR utils.py:1395 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-05 21:59:26,314 WARNING utils.py:1401 -- Unable to connect to GCS (ray head) at 172.17.0.3:6379. Check that (1) Ray with matching version started successfully at the specified address, (2) this node can reach the specified address, and (3) there is no firewall setting preventing access.
GeneDer commented 1 year ago

@jayanthnair so I got something running

# Dockerfile
FROM python:3.10.11
# Install libraries and dependencies
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

RUN apt-get update && \
    apt-get install -y --no-install-recommends

WORKDIR /src

COPY requirements.txt /src
RUN pip install --upgrade pip
RUN pip install -r requirements.txt

COPY . /src

WORKDIR /src

CMD ["serve", "run", "serve_agent:agent"]
# requirements.txt
gymnasium==0.26.3
numpy==1.24.3
pandas==2.0.2
ray[data,rllib,serve] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_aarch64.whl
torch==2.0.1
starlette==0.27.0
dm-tree==0.1.8
# azureml-mlflow
# azureml-defaults
# serve_agent.py
import ray.rllib.algorithms.ppo as ppo
from pathlib import Path
from ray import serve
from starlette.requests import Request

folder_path = "checkpoint_000001"
PATH_TO_CHECKPOINT = Path(__file__).absolute().parent / "rllib_checkpoint" / folder_path

@serve.deployment
class ServePPOModel:
    def __init__(self, checkpoint_path) -> None:
        # Re-create the originally used config.
        config = ppo.PPOConfig() \
            .framework("torch") \
            .rollouts(num_rollout_workers=0)
        # Build the Algorithm instance using the config.
        self.algorithm = config.build(env="CartPole-v0")
        # Restore the algo's state from the checkpoint.
        self.algorithm.restore(checkpoint_path)

    async def __call__(self, request: Request):
        json_input = await request.json()
        obs = json_input["observation"]

        action = self.algorithm.compute_single_action(obs)
        return {"action": int(action)}

agent = ServePPOModel.bind(PATH_TO_CHECKPOINT)
serve.run(agent)
# client.py
# Note: `gymnasium` (not `gym`) will be **the** API supported by RLlib from Ray 2.3 on.
try:
    import gymnasium as gym
    gymnasium = True
except Exception:
    import gym
    gymnasium = False

import requests

env = gym.make("CartPole-v1")

for _ in range(5):
    if gymnasium:
        obs, infos = env.reset()
    else:
        obs = env.reset()
    print(f"-> Sending observation {obs}")
    resp = requests.get(
        "http://localhost:8000/", json={"observation": obs.tolist()}
    )
    print(f"<- Received response {resp.json()}")

obs, infos = env.reset()
print(f"-> Sending observation {obs}")
resp = requests.get(
    "http://localhost:8000/", json={"observation": obs.tolist()}
)
print(f"<- Received response {resp.json()}")

File structure looks like this

- /rllib_checkpoint
    - /checkpoint_000001
- client.py
- Dockerfile
- requirements.txt
- serve_agent.py

I was able to see the success response ran the following

Can you try those and let me know if this works?

jayanthnair commented 1 year ago

Hi @GeneDer it seems when I follow these steps I can get the Serve deployment to work properly. Based on my understanding then, the served agent can be queried at port 8000 of the docker container, from within the docker container, for responses. But my end goal is to be able to query from outside of the docker container.

When I publish port 8000 of the container to port 8000 of my local machine and try to query it, it is giving me a connectionerror like below.

# docker run command
docker run -it --rm -d -p 8000:8000 rl-agent:latest
# error message

Jayanth.Nair@HZXS1Z2 MINGW64 ~/Desktop/drl_workflow/drl_working_group (deploytest)
$ python client.py
2023-07-06 09:18:31,966 WARNING deprecation.py:50 -- DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
{'start_level': 1, 'obs_normalization': True, 'reward_normalization': False, 'states': {'volume': [0, 30000], 'ca': [0, 200], 'rtime': [0, 500]}, 'actions': {'qa': [0, 10], 'qs': [0, 200]}, 'configvars': {'qout': [75, 250]}}
C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\gymnasium\spaces\box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
-> Sending observation [-0.2        -0.94652086 -0.7159825 ]
Traceback (most recent call last):
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 714, in urlopen
    httplib_response = self._make_request(
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 466, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 461, in _make_request
    httplib_response = conn.getresponse()
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 1375, in getresponse
    response.begin()
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\adapters.py", line 486, in send
    resp = conn.urlopen(
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 798, in urlopen
    retries = retries.increment(
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\util\retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\packages\six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 714, in urlopen
    httplib_response = self._make_request(
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 466, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 461, in _make_request
    httplib_response = conn.getresponse()
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 1375, in getresponse
    response.begin()
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\jAYANTH.NAIR\Desktop\drl_workflow\drl_working_group\client.py", line 24, in <module>
    resp = requests.get(
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

It seems the server is actively closing the connection?

GeneDer commented 1 year ago

@jayanthnair I think I found the issue. Setting the host to 0.0.0.0 is required for docker to expose the service to public seems like. I changed the last line of Dockerfile into

ENTRYPOINT [ "ray", "start", "--head", "--port=6379", "--redis-shard-ports=6380,6381", "--object-manager-port=22345","--node-manager-port=22346","--dashboard-host=0.0.0.0","--block"]

And run the following docker commands:

Let me know if those helps! I would continue to debug on why serve run did not pass the host correctly and possibly file a separate bug ticket for it.

Related question: https://discuss.ray.io/t/what-is-best-practice-for-local-setup/6507

jayanthnair commented 1 year ago

Thanks a lot @GeneDer this has solved the issue! I will file a separate bug ticket as you suggested as well.

lyzyn commented 1 year ago

I have also encountered the issue with your error report. Have you resolved it? 2023-07-05 12:54:10 (ServeReplica:default_ServePPOModel pid=494) 2023-07-05 17:54:10,180 WARNING algorithm_config.py:2534 -- Setting exploration_config={} because you set _enable_rl_module_api=True. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the forward_exploration method of the RLModule at hand. On configs that have a default exploration config, this must be done with config.exploration_config={}.

GeneDer commented 1 year ago

@lyzyn I'm not sure if I have context for your issue. Can you file a separate issue and fill in the details? Also, the way you described it, it seems to be a rllib issue instead of serve?