ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.68k stars 5.72k forks source link

[<Ray component: Serve>] Ray Serve not passing the host correctlu #37151

Closed jayanthnair closed 1 year ago

jayanthnair commented 1 year ago

What happened + What you expected to happen

Following up on 37107, when deploying a trained DRL agent as a docker container and querying it for responses, when I create the container like below:

# Dockerfile
FROM python:3.10.11
# Install libraries and dependencies
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

RUN apt-get update && \
    apt-get install -y --no-install-recommends

WORKDIR /src

COPY requirements.txt /src
RUN pip install --upgrade pip
RUN pip install -r requirements.txt

COPY . /src

WORKDIR /src

CMD ["serve", "run", "serve_agent:agent"]

Build and run it like below:

docker build . -t rl-agent:latest
docker run -it --rm -d -p 8000:8000 rl-agent:latest

And query it using cURL commands or requests from my local machine, I get a connectionerror

# error message

Jayanth.Nair@HZXS1Z2 MINGW64 ~/Desktop/drl_workflow/drl_working_group (deploytest)
$ python client.py
2023-07-06 09:18:31,966 WARNING deprecation.py:50 -- DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
{'start_level': 1, 'obs_normalization': True, 'reward_normalization': False, 'states': {'volume': [0, 30000], 'ca': [0, 200], 'rtime': [0, 500]}, 'actions': {'qa': [0, 10], 'qs': [0, 200]}, 'configvars': {'qout': [75, 250]}}
C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\gymnasium\spaces\box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
-> Sending observation [-0.2        -0.94652086 -0.7159825 ]
Traceback (most recent call last):
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 714, in urlopen
    httplib_response = self._make_request(
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 466, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 461, in _make_request
    httplib_response = conn.getresponse()
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 1375, in getresponse
    response.begin()
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\adapters.py", line 486, in send
    resp = conn.urlopen(
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 798, in urlopen
    retries = retries.increment(
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\util\retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\packages\six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 714, in urlopen
    httplib_response = self._make_request(
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 466, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\urllib3\connectionpool.py", line 461, in _make_request
    httplib_response = conn.getresponse()
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 1375, in getresponse
    response.begin()
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\http\client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\jAYANTH.NAIR\Desktop\drl_workflow\drl_working_group\client.py", line 24, in <module>
    resp = requests.get(
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\raynightly\lib\site-packages\requests\adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Following @GeneDer's recommendation as linked [here] (https://github.com/ray-project/ray/issues/37107#issuecomment-1623994721) was the only way I was able to query the container for responses. The expected behavior is for the original dockerfile and build process to be sufficient to query the container.

Versions / Dependencies

adal==1.2.7
aiohttp==3.8.4
aiohttp-cors==0.7.0
aiorwlock==1.3.0
aiosignal==1.3.1
anyio==3.7.0
argcomplete==2.1.2
async-timeout==4.0.2
attrs==23.1.0
azure-common==1.1.28
azure-core==1.27.1
azure-graphrbac==0.61.1
azure-identity==1.13.0
azure-mgmt-authorization==3.0.0
azure-mgmt-containerregistry==10.1.0
azure-mgmt-core==1.4.0
azure-mgmt-keyvault==10.2.2
azure-mgmt-resource==21.2.1
azure-mgmt-storage==20.1.0
azure-storage-blob==12.13.0
azureml-core==1.48.0
azureml-dataprep==4.8.6
azureml-dataprep-native==38.0.0
azureml-dataprep-rslex==2.15.2
azureml-dataset-runtime==1.48.0
azureml-defaults==1.48.0
azureml-inference-server-http==0.7.7
azureml-mlflow==1.52.0
backports.tempfile==1.0
backports.weakref==1.0.post1
bcrypt==4.0.1
blessed==1.20.0
blinker==1.6.2
cachetools==5.3.1
certifi==2023.5.7
cffi==1.15.1
charset-normalizer==3.1.0
click==8.1.3
cloudpickle==2.2.1
cmake==3.26.4
colorful==0.5.5
contextlib2==21.6.0
cryptography==38.0.4
databricks-cli==0.17.7
distlib==0.3.6
distro==1.8.0
dm-tree==0.1.8
docker==6.1.3
dotnetcore2==3.1.23
entrypoints==0.4
exceptiongroup==1.1.2
fastapi==0.99.1
filelock==3.12.2
Flask==2.3.2
Flask-Cors==3.0.10
frozenlist==1.3.3
fsspec==2023.6.0
fusepy==3.0.1
gitdb==4.0.10
GitPython==3.1.31
google-api-core==2.11.1
google-auth==2.21.0
googleapis-common-protos==1.59.1
gpustat==1.1
grpcio==1.56.0
gunicorn==20.1.0
Gymnasium==0.26.3
gymnasium-notices==0.0.1
h11==0.14.0
humanfriendly==10.0
idna==3.4
imageio==2.31.1
importlib-metadata==6.7.0
inference-schema==1.5.1
isodate==0.6.1
itsdangerous==2.1.2
jeepney==0.8.0
Jinja2==3.1.2
jmespath==1.0.1
jsonpickle==2.2.0
jsonschema==4.17.3
knack==0.10.1
lazy_loader==0.3
lit==16.0.6
lz4==4.3.2
markdown-it-py==3.0.0
MarkupSafe==2.1.3
mdurl==0.1.2
mlflow-skinny==2.4.1
mpmath==1.3.0
msal==1.22.0
msal-extensions==1.0.0
msgpack==1.0.5
msrest==0.7.1
msrestazure==0.6.4
multidict==6.0.4
ndg-httpsclient==0.5.1
networkx==3.1
numpy==1.24.3
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-ml-py==11.525.131
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
oauthlib==3.2.2
opencensus==0.11.2
opencensus-context==0.1.3
opencensus-ext-azure==1.1.9
packaging==21.3
pandas==2.0.2
paramiko==2.12.0
pathspec==0.11.1
Pillow==10.0.0
pkginfo==1.9.6
platformdirs==3.8.0
portalocker==2.7.0
prometheus-client==0.17.0
protobuf==4.23.3
psutil==5.9.5
py-spy==0.3.14
pyarrow==9.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==1.10.11
Pygments==2.15.1
PyJWT==2.7.0
PyNaCl==1.5.0
pyOpenSSL==22.1.0
pyparsing==3.1.0
pyrsistent==0.19.3
PySocks==1.7.1
python-dateutil==2.8.2
pytz==2023.3
PyWavelets==1.4.1
PyYAML==6.0
ray @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl
requests==2.31.0
requests-oauthlib==1.3.1
rich==13.4.2
rsa==4.9
scikit-image==0.21.0
scipy==1.11.1
SecretStorage==3.3.3
six==1.16.0
smart-open==6.3.0
smmap==5.0.0
sniffio==1.3.0
sqlparse==0.4.4
starlette==0.27.0
sympy==1.12
tabulate==0.9.0
tensorboardX==2.6.1
tifffile==2023.7.4
torch==2.0.1
triton==2.0.0
typer==0.9.0
typing_extensions==4.7.1
tzdata==2023.3
urllib3==1.26.16
uvicorn==0.22.0
virtualenv==20.21.0
wcwidth==0.2.6
websocket-client==1.6.1
Werkzeug==2.3.6
wrapt==1.12.1
yarl==1.9.2
zipp==3.15.0

Reproduction script

#serve_agent.py

from starlette.requests import Request
import ray.rllib.algorithms.ppo as ppo
from ray import serve
import gymnasium.spaces as spaces
import numpy as np
from pathlib import Path

folder_path = "checkpoint_000020"
PATH_TO_CHECKPOINT = Path(__file__).absolute().parent / "inference_checkpoints" / folder_path

# Update this to reflect observation and action space in environment
observation_space = spaces.Box(
            low=np.array([0, 0, 0]),
            high=np.array([30000, 20, 300]),
            shape=(3,),
            dtype=np.float32,
        )
action_space = spaces.Box(
            low=np.array([0, 0]),
            high=np.array([10, 200]),
            shape=(2,),
            dtype=np.float32,
        )
@serve.deployment
class ServePPOModel:
    """
    Class which defines how the model is served

    args:
        - checkpoint_path (str): path to the checkpoint

    """
    def __init__(self, checkpoint_path) -> None:
        # Re-create the originally used config. - NOTE this assumes PPO, update for a different algorithm as needed
        config = ppo.PPOConfig()\
            .framework("torch")\
            .rollouts(num_rollout_workers=0)

        # Build the Algorithm instance using the config. env=None since it is not needed for inference
        self.algorithm = config.environment(env=None,observation_space=observation_space,action_space=action_space).build()
        # self.algorithm = config.build(env="mixing_environment")
        # Restore the algorithm state from the checkpoint
        self.algorithm.restore(checkpoint_path)
        print('restored!')

    async def __call__(self, request: Request):
        json_input = await request.json()
        obs = json_input["observation"] #observation is the key, the list of states are the value in the dictionary we send as data
        action = self.algorithm.compute_single_action(obs)
        return {"action": action}

agent = ServePPOModel.bind(PATH_TO_CHECKPOINT)
serve.run(agent)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

GeneDer commented 1 year ago

@jayanthnair Finally got to the root of this. So everything was working as intended. This is not a bug. The host will be passed by the serve run command correct if there are no previously started service. The issue is that in the serve_agent.py file, it already called serve.run() without a host argument. So the default behavior is to start Ray Serve on 127.0.0.1. Since Ray Serve is already started by the script, the serve run command won't restart it on host 0.0.0.0 and it remains private.

Can you try it again with the following two things:

Let me know if this helps solving the issue 🙂

jayanthnair commented 1 year ago

Solved! Thanks again @GeneDer

ywq2023 commented 5 months ago

@jayanthnair Finally got to the root of this. So everything was working as intended. This is not a bug. The host will be passed by the serve run command correct if there are no previously started service. The issue is that in the serve_agent.py file, it already called serve.run() without a host argument. So the default behavior is to start Ray Serve on 127.0.0.1. Since Ray Serve is already started by the script, the serve run command won't restart it on host 0.0.0.0 and it remains private.

Can you try it again with the following two things:

  • change the last line of Dockerfile to ENTRYPOINT ["serve", "run", "--host", "0.0.0.0", "serve_agent:agent"] so serve run command can start Ray Serve publicly
  • remove the last line of serve_agent.py that runs serve.run() so serve won't be started on local only

Let me know if this helps solving the issue 🙂

I have the same problem, but I got Error: No such option: --host after i execute serve run --host "0.0.0.0" demo:app.

jbohnslav commented 3 months ago

Yes, I think this is a regression-- did we take out the option to pass --host?

grandkarabas commented 1 month ago

@jayanthnair Finally got to the root of this. So everything was working as intended. This is not a bug. The host will be passed by the serve run command correct if there are no previously started service. The issue is that in the serve_agent.py file, it already called serve.run() without a host argument. So the default behavior is to start Ray Serve on 127.0.0.1. Since Ray Serve is already started by the script, the serve run command won't restart it on host 0.0.0.0 and it remains private. Can you try it again with the following two things:

  • change the last line of Dockerfile to ENTRYPOINT ["serve", "run", "--host", "0.0.0.0", "serve_agent:agent"] so serve run command can start Ray Serve publicly
  • remove the last line of serve_agent.py that runs serve.run() so serve won't be started on local only

Let me know if this helps solving the issue 🙂

I have the same problem, but I got Error: No such option: --host after i execute serve run --host "0.0.0.0" demo:app.

Please, use the following syntax before serve.run() call:

serve.start(http_options={"host": "0.0.0.0", "port": 5000})