I was trying to run llama-2 on a machine with V100 GPU.
I ran aviary run --model ~/models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml inside the docker container and got a stack trace:
(HTTPProxyActor pid=2448) INFO 2023-08-22 23:38:44,774 http_proxy 172.17.0.2 http_proxy.py:904 - Proxy actor f5a0692e60801e1b0ef45a8301000000 starting on node 57297f3255438333c74bdc7b75d3fd3aa4b1c48e7bdcf6d07db72a41.
[INFO 2023-08-22 23:38:44,824] api.py: 320 Started detached Serve instance in namespace "serve".
(HTTPProxyActor pid=2448) INFO: Started server process [2448]
[INFO 2023-08-22 23:38:44,951] api.py: 300 Connecting to existing Serve app in namespace "serve". New http options will not be applied.
(ServeController pid=2420) INFO 2023-08-22 23:38:44,942 controller 2420 deployment_state.py:1319 - Deploying new version of deployment meta-llama--Llama-2-7b-chat-hf_meta-llama--Llama-2-7b-chat-hf.
(ServeController pid=2420) INFO 2023-08-22 23:38:45,046 controller 2420 deployment_state.py:1586 - Adding 1 replica to deployment meta-llama--Llama-2-7b-chat-hf_meta-llama--Llama-2-7b-chat-hf.
(ServeController pid=2420) INFO 2023-08-22 23:38:45,083 controller 2420 deployment_state.py:1319 - Deploying new version of deployment router_Router.
(ServeController pid=2420) INFO 2023-08-22 23:38:45,187 controller 2420 deployment_state.py:1586 - Adding 2 replicas to deployment router_Router.
(ServeReplica:router_Router pid=2480) There was a problem when trying to write in your cache folder (/home/jupyter/cache/data/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
(autoscaler +15s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +15s) Error: No available node types can fulfill resource request {'accelerator_type_a10': 0.01, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(ServeController pid=2420) WARNING 2023-08-22 23:39:15,112 controller 2420 deployment_state.py:1889 - Deployment "meta-llama--Llama-2-7b-chat-hf_meta-llama--Llama-2-7b-chat-hf" has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {"accelerator_type_a10": 0.01, "CPU": 1}, resources available: {"CPU": 14.0}.
(ServeReplica:router_Router pid=2479) There was a problem when trying to write in your cache folder (/home/jupyter/cache/data/hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
(autoscaler +50s) Error: No available node types can fulfill resource request {'accelerator_type_a10': 0.01, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
I was trying to run llama-2 on a machine with V100 GPU.
I ran
aviary run --model ~/models/continuous_batching/meta-llama--Llama-2-7b-chat-hf.yaml
inside the docker container and got a stack trace:Is aviary incompatible with V100 GPUs?