[<Ray component: Serve>] ModuleNotFoundError : No module named utils

jayanthnair commented 1 year ago

What happened + What you expected to happen

I have used Ray RLLib and Ray on AML to train an RL agent on AzureML. I've downloaded the checkpoints to my local machine and have created a script to serve the agent by using the template provided here. When I try to run the script from the command line using the serve API, I get an error saying 'ModuleNotFoundError: No module named 'utils''. However, if I copy paste the same code into a jupyter notebook, it runs fine. Console logs below:

Jayanth.Nair@HZXS1Z2 MINGW64 ~/Desktop/drl_workflow/drl_working_group/deploy (deploytest)
$ serve run serve_agent:agent
2023-07-03 11:43:20,727 INFO scripts.py:404 -- Running import path: 'serve_agent:agent'.
C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\site-packages\gymnasium\spaces\box.
  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2023-07-03 11:43:25,252 INFO worker.py:1452 -- Connecting to existing Ray cluster at addres
2023-07-03 11:43:25,286 INFO worker.py:1627 -- Connected to Ray cluster. View the dashboard
(ServeController pid=3236) INFO 2023-07-03 11:43:32,060 controller 3236 deployment_state.py
(HTTPProxyActor pid=19492) INFO:     Started server process [19492]
(ServeController pid=3236) INFO 2023-07-03 11:43:32,158 controller 3236 deployment_state.py
(ServeReplica:default_ServePPOModel pid=13924) C:\Users\jAYANTH.NAIR\Miniconda3\envs\traini: This API is deprecated and may be removed in future Ray releases. You could suppress this
(ServeReplica:default_ServePPOModel pid=13924) `UnifiedLogger` will be removed in Ray 2.7.
(ServeReplica:default_ServePPOModel pid=13924)   return UnifiedLogger(config, logdir, logge
(ServeReplica:default_ServePPOModel pid=13924) C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainiPI is deprecated and may be removed in future Ray releases. You could suppress this warning
(ServeReplica:default_ServePPOModel pid=13924) The `JsonLogger interface is deprecated in f7.
(ServeReplica:default_ServePPOModel pid=13924)   self._loggers.append(cls(self.config, self
(ServeReplica:default_ServePPOModel pid=13924) C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainiPI is deprecated and may be removed in future Ray releases. You could suppress this warning
(ServeReplica:default_ServePPOModel pid=13924) The `CSVLogger interface is deprecated in fa
(ServeReplica:default_ServePPOModel pid=13924)   self._loggers.append(cls(self.config, self
(ServeReplica:default_ServePPOModel pid=13924) C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainiPI is deprecated and may be removed in future Ray releases. You could suppress this warning
(ServeReplica:default_ServePPOModel pid=13924) The `TBXLogger interface is deprecated in fa
Ray 2.7.
(ServeReplica:default_ServePPOModel pid=13924)   self._loggers.append(cls(self.config, self
(ServeReplica:default_ServePPOModel pid=13924) 2023-07-03 11:43:38,240  INFO algorithm.py:5G' or use the -v and -vv flags.
(ServeReplica:default_ServePPOModel pid=13924) 2023-07-03 11:43:38,255  WARNING util.py:68 
(ServeController pid=3236) ERROR 2023-07-03 11:43:38,387 controller 3236 deployment_state.ped.
(ServeController pid=3236) Traceback (most recent call last):
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return fn(*args, **kwargs)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return func(*args, **kwargs)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     raise value.as_instanceof_cause()
(ServeController pid=3236) ray.exceptions.RayTaskError(RuntimeError): ray::ServeReplica:def37e7effdddf4cc162dbff401000000, repr=<ray.serve._private.replica.ServeReplica:default_Serve
(ServeController pid=3236)   File "python\ray\_raylet.pyx", line 1434, in ray._raylet.execu
(ServeController pid=3236)   File "python\ray\_raylet.pyx", line 1438, in ray._raylet.execu
(ServeController pid=3236)   File "python\ray\_raylet.pyx", line 1373, in ray._raylet.execu
(ServeController pid=3236)   File "python\ray\_raylet.pyx", line 3609, in ray._raylet.CoreW
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return self.__get_result()
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     raise self._exception
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return await method(self, *_args, **_kwargs)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\(ServeController pid=3236)     raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=3236) RuntimeError: Traceback (most recent call last):
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\(ServeController pid=3236)     await self._initialize_replica()
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     await sync_to_async(_callable.__init__)(*init_args, **init_k
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return func(*args, **kwargs)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Desktop\drl_workflow\drl_working_g
(ServeController pid=3236)     self.algorithm.restore(checkpoint_path)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     metadata = TrainableUtil.load_metadata(checkpoint_dir)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return pickle.load(f)
(ServeController pid=3236) ModuleNotFoundError: No module named 'utils'
(ServeController pid=3236) INFO 2023-07-03 11:43:38,625 controller 3236 deployment_state.py
(ServeReplica:default_ServePPOModel pid=13228) C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainiPI is deprecated and may be removed in future Ray releases. You could suppress this warningss cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplicuplication for more options.)
(ServeReplica:default_ServePPOModel pid=13228)   self._loggers.append(cls(self.config, self
(ServeReplica:default_ServePPOModel pid=13228)   self._loggers.append(cls(self.config, self
(ServeReplica:default_ServePPOModel pid=13228)   self._loggers.append(cls(self.config, self
(ServeReplica:default_ServePPOModel pid=13228)   self._loggers.append(cls(self.config, self
(ServeReplica:default_ServePPOModel pid=13228)   self._loggers.append(cls(self.config, self
(ServeReplica:default_ServePPOModel pid=13228)   self._loggers.append(cls(self.config, self
(ServeController pid=3236) ERROR 2023-07-03 11:43:44,902 controller 3236 deployment_state.ped.
(ServeController pid=3236) Traceback (most recent call last):
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return fn(*args, **kwargs)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return func(*args, **kwargs)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     raise value.as_instanceof_cause()
(ServeController pid=3236) ray.exceptions.RayTaskError(RuntimeError): ray::ServeReplica:def544f381e38baeb9131ca0c01000000, repr=<ray.serve._private.replica.ServeReplica:default_Serve
(ServeController pid=3236)   File "python\ray\_raylet.pyx", line 1434, in ray._raylet.execu
(ServeController pid=3236)   File "python\ray\_raylet.pyx", line 1438, in ray._raylet.execu
(ServeController pid=3236)   File "python\ray\_raylet.pyx", line 1373, in ray._raylet.execu
(ServeController pid=3236)   File "python\ray\_raylet.pyx", line 3609, in ray._raylet.CoreW
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return self.__get_result()
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     raise self._exception
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return await method(self, *_args, **_kwargs)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\(ServeController pid=3236)     raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=3236) RuntimeError: Traceback (most recent call last):
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\(ServeController pid=3236)     await self._initialize_replica()
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     await sync_to_async(_callable.__init__)(*init_args, **init_k
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return func(*args, **kwargs)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Desktop\drl_workflow\drl_working_g
(ServeController pid=3236)     self.algorithm.restore(checkpoint_path)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     metadata = TrainableUtil.load_metadata(checkpoint_dir)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return pickle.load(f)
(ServeController pid=3236) ModuleNotFoundError: No module named 'utils'
(ServeReplica:default_ServePPOModel pid=13228) 2023-07-03 11:43:44,792  INFO algorithm.py:5G' or use the -v and -vv flags.
(ServeReplica:default_ServePPOModel pid=13228) 2023-07-03 11:43:44,832  WARNING util.py:68 
(ServeController pid=3236) INFO 2023-07-03 11:43:45,121 controller 3236 deployment_state.py
(ServeReplica:default_ServePPOModel pid=11520) `UnifiedLogger` will be removed in Ray 2.7.
(ServeReplica:default_ServePPOModel pid=11520)   return UnifiedLogger(config, logdir, logge
(ServeReplica:default_ServePPOModel pid=11520) The `JsonLogger interface is deprecated in f7.
(ServeReplica:default_ServePPOModel pid=11520) The `CSVLogger interface is deprecated in fa
(ServeReplica:default_ServePPOModel pid=11520) The `TBXLogger interface is deprecated in fa
Ray 2.7.
(ServeReplica:default_ServePPOModel pid=11520) C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainiPI is deprecated and may be removed in future Ray releases. You could suppress this warningss cluster]
(ServeReplica:default_ServePPOModel pid=11520)   self._loggers.append(cls(self.config, self
(ServeReplica:default_ServePPOModel pid=11520) 2023-07-03 11:43:51,308  WARNING util.py:68 
(ServeReplica:default_ServePPOModel pid=11520) 2023-07-03 11:43:51,308  WARNING util.py:68 
(ServeController pid=3236) ERROR 2023-07-03 11:43:51,387 controller 3236 deployment_state.ped.
(ServeController pid=3236) Traceback (most recent call last):
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return fn(*args, **kwargs)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return func(*args, **kwargs)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     raise value.as_instanceof_cause()
(ServeController pid=3236) ray.exceptions.RayTaskError(RuntimeError): ray::ServeReplica:defcd5b5fa21abe619c2326fd01000000, repr=<ray.serve._private.replica.ServeReplica:default_Serve
(ServeController pid=3236)   File "python\ray\_raylet.pyx", line 1434, in ray._raylet.execu
(ServeController pid=3236)   File "python\ray\_raylet.pyx", line 1438, in ray._raylet.execu
(ServeController pid=3236)   File "python\ray\_raylet.pyx", line 1373, in ray._raylet.execu
(ServeController pid=3236)   File "python\ray\_raylet.pyx", line 3609, in ray._raylet.CoreW
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return self.__get_result()
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     raise self._exception
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return await method(self, *_args, **_kwargs)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\(ServeController pid=3236)     raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=3236) RuntimeError: Traceback (most recent call last):
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\(ServeController pid=3236)     await self._initialize_replica()
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     await sync_to_async(_callable.__init__)(*init_args, **init_k
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return func(*args, **kwargs)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Desktop\drl_workflow\drl_working_g
(ServeController pid=3236)     self.algorithm.restore(checkpoint_path)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     metadata = TrainableUtil.load_metadata(checkpoint_dir)
(ServeController pid=3236)   File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\
(ServeController pid=3236)     return pickle.load(f)
(ServeController pid=3236) ModuleNotFoundError: No module named 'utils'
(ServeController pid=3236) INFO 2023-07-03 11:43:51,614 controller 3236 deployment_state.py
Traceback (most recent call last):
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\site-packages\ray\serve\scr
    handle = serve.run(app, host=host, port=port)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\site-packages\ray\serve\api
    client.deploy_application(
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\site-packages\ray\serve\_pr
    return f(self, *args, **kwargs)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\site-packages\ray\serve\_pr
    self._wait_for_deployment_healthy(deployment_name)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\site-packages\ray\serve\_pr
    raise RuntimeError(
RuntimeError: Deployment default_ServePPOModel is UNHEALTHY: The Deployment failed to start initial health check failing. See controller logs for details. Retrying after 1 seconds. E
ray::ServeReplica:default_ServePPOModel.initialize_and_get_metadata() (pid=11520, ip=127.0.Replica:default_ServePPOModel object at 0x00000255D4E65150>)
  File "python\ray\_raylet.pyx", line 1434, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 1438, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 1373, in ray._raylet.execute_task.function_executor
  File "python\ray\_raylet.pyx", line 3609, in ray._raylet.CoreWorker.run_async_func_or_cor
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\concurrent\futures\_base.py
    return self.__get_result()
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\concurrent\futures\_base.py
    raise self._exception
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\site-packages\ray\util\trac
    return await method(self, *_args, **_kwargs)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\site-packages\ray\serve\_pr
    raise RuntimeError(traceback.format_exc()) from None
RuntimeError: Traceback (most recent call last):
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\site-packages\ray\serve\_pr
    await self._initialize_replica()
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\site-packages\ray\serve\_pr
    await sync_to_async(_callable.__init__)(*init_args, **init_kwargs)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\site-packages\ray\_private\
    return func(*args, **kwargs)
  File "C:\Users\jAYANTH.NAIR\Desktop\drl_workflow\drl_working_group\deploy\.\serve_agent.p
    self.algorithm.restore(checkpoint_path)
  File "C:\Users\jAYANTH.NAIR\Miniconda3\envs\trainingenv25\lib\site-packages\ray\tune\trai
    metadata = TrainableUtil.load_metadata(checkpoint_dir)
ess this warning by setting R\Miniconda3\envs\trainingenv25\lib\site-packages\ray\tune\trai
 env variable PYTHONWARNINGS="ignore::DeprecationWarning" [repeated 2x across cluster]
(ServeReplica:default_ServePPOModel pid=21612)   self._loggers.append(cls(self.config, self.logdir, self.trial))```

### Versions / Dependencies

OS: Windows 10

Packages:

absl-py==1.4.0
adal==1.2.7
aiohttp==3.8.4
aiohttp-cors==0.7.0
aiorwlock==1.3.0
aiosignal==1.3.1
ansicon==1.89.0
anyio==3.7.0
argcomplete==2.1.2
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
asttokens==2.2.1
async-timeout==4.0.2
attrs==23.1.0
azure-common==1.1.28
azure-core==1.27.1
azure-graphrbac==0.61.1
azure-identity==1.13.0
azure-mgmt-authorization==3.0.0
azure-mgmt-containerregistry==10.1.0
azure-mgmt-core==1.4.0
azure-mgmt-keyvault==10.2.2
azure-mgmt-resource==21.2.1
azure-mgmt-storage==20.1.0
azure-storage-blob==12.13.0
azureml-core==1.48.0
azureml-dataprep==4.8.6
azureml-dataprep-native==38.0.0
azureml-dataprep-rslex==2.15.2
azureml-dataset-runtime==1.48.0
azureml-defaults==1.48.0
azureml-inference-server-http==0.7.7
azureml-mlflow==1.52.0
backcall==0.2.0
backports.tempfile==1.0
backports.weakref==1.0.post1
bcrypt==4.0.1
beautifulsoup4==4.12.2
bleach==6.0.0
blessed==1.20.0
blinker==1.6.2
cachetools==5.3.1
certifi @ file:///C:/b/abs_85o_6fm0se/croot/certifi_1671487778835/work/certifi
cffi==1.15.1
charset-normalizer==3.1.0
click==8.1.3
cloudpickle==2.2.1
colorama==0.4.6
colorful==0.5.5
comm==0.1.3
contextlib2==21.6.0
cryptography==38.0.4
databricks-cli==0.17.7
debugpy==1.6.7
decorator==5.1.1
defusedxml==0.7.1
distlib==0.3.6
distro==1.8.0
dm-tree==0.1.8
docker==6.1.3
dotnetcore2==3.1.23
entrypoints==0.4
exceptiongroup==1.1.1
executing==1.2.0
fastapi==0.99.1
fastjsonschema==2.17.1
filelock==3.12.2
Flask==2.3.2
Flask-Cors==3.0.10
fqdn==1.5.1
frozenlist==1.3.3
fsspec==2023.6.0
fusepy==3.0.1
gast==0.5.4
gitdb==4.0.10
GitPython==3.1.31
google-api-core==2.11.1
google-auth==2.21.0
googleapis-common-protos==1.59.1
gpustat==1.1
grpcio==1.51.3
Gymnasium==0.26.3
gymnasium-notices==0.0.1
h11==0.14.0
humanfriendly==10.0
idna==3.4
imageio==2.31.1
importlib-metadata==6.7.0
inference-schema==1.5.1
ipykernel==6.23.3
ipython==8.14.0
ipython-genutils==0.2.0
ipywidgets==8.0.6
isodate==0.6.1
isoduration==20.11.0
itsdangerous==2.1.2
jedi==0.18.2
jeepney==0.8.0
Jinja2==3.1.2
jinxed==1.2.0
jmespath==1.0.1
jsonpickle==2.2.0
jsonpointer==2.4
jsonschema==4.17.3
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.6.3
jupyter_client==8.3.0
jupyter_core==5.3.1
jupyter_server==2.7.0
jupyter_server_terminals==0.4.4
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.7
knack==0.10.1
lazy_loader==0.2
lz4==4.3.2
markdown-it-py==3.0.0
MarkupSafe==2.1.3
matplotlib-inline==0.1.6
mdurl==0.1.2
mistune==3.0.1
mlflow-skinny==2.4.1
mpmath==1.3.0
msal==1.22.0
msal-extensions==1.0.0
msgpack==1.0.5
msrest==0.7.1
msrestazure==0.6.4
multidict==6.0.4
nbclassic==1.0.0
nbclient==0.8.0
nbconvert==7.6.0
nbformat==5.9.0
ndg-httpsclient==0.5.1
nest-asyncio==1.5.6
networkx==3.1
notebook==6.5.4
notebook_shim==0.2.3
numpy==1.24.3
nvidia-ml-py==11.525.131
oauthlib==3.2.2
opencensus==0.11.2
opencensus-context==0.1.3
opencensus-ext-azure==1.1.9
overrides==7.3.1
packaging==21.3
pandas==2.0.2
pandocfilters==1.5.0
paramiko==2.12.0
parso==0.8.3
pathspec==0.11.1
pickleshare==0.7.5
Pillow==9.5.0
pkginfo==1.9.6
platformdirs==3.8.0
portalocker==2.7.0
prometheus-client==0.17.0
prompt-toolkit==3.0.38
protobuf==4.23.3
psutil==5.8.0
pure-eval==0.2.2
py-spy==0.3.14
pyarrow==6.0.1
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==1.10.10
Pygments==2.15.1
PyJWT==2.7.0
PyNaCl==1.5.0
pyOpenSSL==22.1.0
pyparsing==3.1.0
pyreadline3==3.4.1
pyrsistent==0.19.3
PySocks==1.7.1
python-dateutil==2.8.2
python-json-logger==2.0.7
pytz==2023.3
PyWavelets==1.4.1
pywin32==306
pywinpty==2.0.10
PyYAML==6.0
pyzmq==25.1.0
qtconsole==5.4.3
QtPy==2.3.1
ray==2.5.0
ray-on-aml==0.2.4
requests==2.31.0
requests-oauthlib==1.3.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.4.2
rsa==4.9
scikit-image==0.21.0
scipy==1.10.1
SecretStorage==3.3.3
Send2Trash==1.8.2
six==1.16.0
smart-open==6.3.0
smmap==5.0.0
sniffio==1.3.0
soupsieve==2.4.1
sqlparse==0.4.4
stack-data==0.6.2
starlette==0.27.0
sympy==1.12
tabulate==0.9.0
tensorboardX==2.6.1
tensorflow-probability==0.20.1
terminado==0.17.1
tifffile==2023.4.12
tinycss2==1.2.1
torch==2.0.1
tornado==6.3.2
traitlets==5.9.0
typer==0.9.0
typing_extensions==4.6.3
tzdata==2023.3
uri-template==1.3.0
urllib3==1.26.16
uvicorn==0.22.0
virtualenv==20.21.0
waitress==2.1.2
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.6.1
Werkzeug==2.3.6
widgetsnbextension==4.0.7
wincertstore==0.2
wrapt==1.12.1
yarl==1.9.2
zipp==3.15.0

### Reproduction script

```from starlette.requests import Request
import ray.rllib.algorithms.ppo as ppo
from ray import serve
import gymnasium.spaces as spaces
import numpy as np
from pathlib import Path

folder_path = "checkpoint_000020"
PATH_TO_CHECKPOINT = Path(__file__).absolute().parent.parent / "inference_checkpoints" / folder_path

observation_space = spaces.Box(
            low=np.array([0, 0, 0]),
            high=np.array([30000, 20, 300]),
            shape=(3,),
            dtype=np.float32,
        )
action_space = spaces.Box(
            low=np.array([0, 0]),
            high=np.array([10, 200]),
            shape=(2,),
            dtype=np.float32,
        )
@serve.deployment
class ServePPOModel:
    def __init__(self, checkpoint_path) -> None:
        config = ppo.PPOConfig()\
            .framework("torch")\
            .rollouts(num_rollout_workers=0)

        self.algorithm = config.environment(env=None,observation_space=observation_space,action_space=action_space).build()

        self.algorithm.restore(checkpoint_path)
        print('restored!')

    async def __call__(self, request: Request):
        json_input = await request.json()
        obs = json_input["observation"] #observation is the key, the list of states are the value in the dictionary we send as data
        action = self.algorithm.compute_single_action(obs)
        return {"action": action}

agent = ServePPOModel.bind(PATH_TO_CHECKPOINT)

Issue Severity

High: It blocks me from completing my task.

shrekris-anyscale commented 1 year ago

Hi @jayanthnair, thanks for submitting this! This is a known issue that should be fixed by #36324. Could you retry your experiment with the Ray nightly and check if it still fails?

jayanthnair commented 1 year ago

Hi @shrekris-anyscale, I tried the following 2 things:

Create a fresh conda environment and install ray nightly by using pip install -U "ray[rllib,data,serve] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-win_amd64.whl" on my local machine. I tried serving the old checkpoint created with a previous ray version and it gave me the same error
In order to eliminate package dependency incompatibilities, I reran the experiment on AzureML by using ray nightly (linux version - ray[data,rllib,serve] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl) on my compute cluster. Once I download the checkpoint files and try again, I am still seeing the same issue.

Anything else I should be trying?

GeneDer commented 1 year ago

@jayanthnair I tried with the exact tutorial code

# serve_agent.py

import ray
import ray.rllib.algorithms.ppo as ppo
from ray import serve

def train_ppo_model():
    # Configure our PPO algorithm.
    config = (
        ppo.PPOConfig()
        .environment("CartPole-v1")
        .framework("torch")
        .rollouts(num_rollout_workers=0)
    )
    # Create a `PPO` instance from the config.
    algo = config.build()
    # Train for one iteration.
    algo.train()
    # Save state of the trained Algorithm in a checkpoint.
    checkpoint_dir = algo.save("/tmp/rllib_checkpoint")
    return checkpoint_dir

checkpoint_path = train_ppo_model()

from starlette.requests import Request

@serve.deployment
class ServePPOModel:
    def __init__(self, checkpoint_path) -> None:
        # Re-create the originally used config.
        config = ppo.PPOConfig()\
            .framework("torch")\
            .rollouts(num_rollout_workers=0)
        # Build the Algorithm instance using the config.
        self.algorithm = config.build(env="CartPole-v0")
        # Restore the algo's state from the checkpoint.
        self.algorithm.restore(checkpoint_path)

    async def __call__(self, request: Request):
        json_input = await request.json()
        obs = json_input["observation"]

        action = self.algorithm.compute_single_action(obs)
        return {"action": int(action)}

ppo_model = ServePPOModel.bind(checkpoint_path)
serve.run(ppo_model)

and start it with serve run serve_agent:agent

I was able to query like so

I also just tried with the very simple app file structure like

- hello_serve.py
- utils/
    - test.py

and the file content like

# hello_serve.py
import time
from ray import serve
from starlette.requests import Request
from utils.test import hello

@serve.deployment
class HelloModel:
    def __init__(self):
        hello()

    async def __call__(self, starlette_request: Request) -> None:
        hello()
        return f"{hello()}, {time.time()}"

model = HelloModel.bind()

# test.py
def hello():
    text = "hello_from_utils"
    print(text)
    return text

When I run serve run hello_serve:model everything still works as expected and was able to return the text and the timestamp. Can you try the above examples and see if you get the same import utils error?

GeneDer commented 1 year ago

Also, just want to double check can you also try running ray --version to see if the installed version is the latest nightly?

I think you may or may not need to run pip uninstall -y ray before install from the wheel if you already have ray installed. Something like pip uninstall -y ray && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl might help.

jayanthnair commented 1 year ago

@GeneDer Thanks for the response. I think I might have finally figured this out.

When I have a folder structure like this

- deploy/
    - serve_agent.py
- inference_checkpoints/
   - checkpoint_00010/
       - checkpoint files 
- utils/
    - some_util.py

and I run the serve run command from the deploy folder, I get the nomodule named utils error. However, when I copy the serve_agent script and move it up a level , i.e. in the same base folder as the utils folder this error goes away. Curiously, none of the scripts saved in the utils folder are needed for deployment. And if I change the name of the utils folder, I get the same error again. So it seems like it is looking for a module named utils in the same working folder as the deployment script. Is this intended?

GeneDer commented 1 year ago

Hi @jayanthnair Yes, custom modules should live in the same directory/ subdirectories of the deployment script. There are some Ray code to add the deployment script directory to the python import path so they will be importable by Python. The bug we fixed previously is that the custom utils collided with Ray's utils and Ray's utils got the import precedence. With the latest Ray, user's utils should take the import precedence given it's living in the same directory as the deployment script :)

Thanks for reporting your issue is fix! Feel free to let me know if you have other questions!

ray-project / ray

[<Ray component: Serve>] ModuleNotFoundError : No module named utils #37042

What happened + What you expected to happen

Issue Severity