Open mahaddad opened 1 year ago
Thanks, will try to reproduce and get back to you!
In regards to aviary models, unfortunately it would be non-trivial to make it only return already deployed models. The fact it is not showing new models, however, is an issue.
Also, one other thing that would be helpful would be the autoscaler logs (on the head node: /tmp/ray/session_latest/logs/monitor.log and monitor.err). We have had trouble provisioning g5 nodes ourselves, and I think this may be a simple AWS capacity issue.
I am also happy to schedule some time on Friday to debug together! Let me know if you are interested.
Finally, I would recommend deploying with multiple models specified at once, instead of calling aviary run several times - aviary run --model MODEL1 --model MODEL2
.
I would love the chance to debug together. I can make myself available anytime on Friday that works best for you. Can you shoot me a note at michael@konko.ai ?
In the mean time, I will try running aviary run with multiple models as you suggested and perform some further testing to capture the logs you mentioned.
Hi Aviary team,
Thanks for the great package. I am trying to get it to work for my use case and I am running into several issues. Details are provided below. Let me know if I can provide any additional information to help identify the root cause.
Using the latest docker image and default deploy/ray/aviary-cluster.yaml with the following change:
gpu_worker_g5: node_config: InstanceType: g5.4xlarge BlockDeviceMappings: *mount resources: worker_node: 1 instance_type_g5: 1 accelerator_type_a10: 1 min_workers: 0 max_workers: 8
When I run
export AVIARY_URL="http://localhost:8000"
aviary run --model ./models/static_batching/mosaicml--mpt-7b-instruct.yaml
aviary run --model ./models/static_batching/OpenAssistant--falcon-7b-sft-top1-696.yaml
Falcon-7b deploys successfully, but mpt-7b-instruct never deploys and just hangs for about an hour until it says failed. If I retry same result. If I try a different model same result. I am well below the vCPU quota on G Instances. I also tried vicuna13b and that also failed to launch a GPU instance.
Also aviary models shows the model running although it is not. For some reason falcon-7b is not shown but it actually is running. If you ping /-/routes directly then you see both models running. Expected behavior would be that only running models available to be queried are shown when you call aviary models.
(base) ray@ip-172-31-52-1:~$ aviary models Connecting to Aviary backend at: http://localhost:8000/ mosaicml/mpt-7b-instruct
(base) ray@ip-172-31-52-1:~$ ray list actors --detail
ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.50.34 pid: 578 name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#HoiArw ray_namespace: serve class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct actor_id: 014519cc10a3c1393952282303000000 never_started: false is_detached: true placement_group_id: null repr_name: ''ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.50.34 pid: 748 name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#ZJCwbG ray_namespace: serve class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct actor_id: 3e24b48d34bd94402134c1f403000000 never_started: false is_detached: true placement_group_id: null repr_name: ''ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.50.34 pid: 476 name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#ZHeJeU ray_namespace: serve class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct actor_id: 642421606c9735b2b038281503000000 never_started: false is_detached: true placement_group_id: null repr_name: ''ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.50.34 pid: 714 name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#PiHekO ray_namespace: serve class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct actor_id: 76073afc503c62fb0fc0c2a303000000 never_started: false is_detached: true placement_group_id: null repr_name: ''ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.50.34 pid: 371 name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#dgCkoa ray_namespace: serve class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct actor_id: 99f4b1e5b33343787c6ebbda03000000 never_started: false is_detached: true placement_group_id: null repr_name: ''ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.52.1 pid: 688 name: SERVE_REPLICA::router_Router#SnpzWO ray_namespace: serve class_name: ServeReplica:router_Router actor_id: 9ed8f9167aff1416f9b40fb903000000 never_started: false is_detached: true placement_group_id: null repr_name: ''ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.50.34 pid: 680 name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#NUJehq ray_namespace: serve class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct actor_id: aa085f6ae7a2de244227f74b03000000 never_started: false is_detached: true placement_group_id: null repr_name: ''ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.52.1 pid: 601 name: SERVE_REPLICA::router_Router#YVeHIq ray_namespace: serve class_name: ServeReplica:router_Router actor_id: bdefd5bdada986dc6a86c20f03000000 never_started: false is_detached: true placement_group_id: null repr_name: ''ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.50.34 pid: 337 name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#JlnvAU ray_namespace: serve class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct actor_id: bfff95555c8aad324470ffd303000000 never_started: false is_detached: true placement_group_id: null repr_name: ''ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.50.34 pid: 405 name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#DGeusc ray_namespace: serve class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct actor_id: cad3c4d91c30b987dd98e33203000000 never_started: false is_detached: true placement_group_id: null repr_name: ''ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.50.34 pid: 510 name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#NeMVAn ray_namespace: serve class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct actor_id: ce3265064d0a2c858b25fde703000000 never_started: false is_detached: true placement_group_id: null repr_name: ''ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.52.1 pid: 857 name: SERVE_REPLICA::router_Router#PzsVij ray_namespace: serve class_name: ServeReplica:router_Router actor_id: dfe8997fd860700452da4ad103000000 never_started: false is_detached: true placement_group_id: null repr_name: ''ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.50.34 pid: 439 name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#oVTtwj ray_namespace: serve class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct actor_id: e1ea02ba81a4f7f6704fd01603000000 never_started: false is_detached: true placement_group_id: null repr_name: ''ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.50.34 pid: 544 name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#ZiPkFw ray_namespace: serve class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct actor_id: e6460cb5a1bd9c9875d71ffa03000000 never_started: false is_detached: true placement_group_id: null repr_name: ''ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.50.34 pid: 646 name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#qObhMp ray_namespace: serve class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct actor_id: e8db0e9b51037e8f85f03fec03000000 never_started: false is_detached: true placement_group_id: null repr_name: ''ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.50.34 pid: 156 name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#XrHvwh ray_namespace: serve class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct actor_id: eeeb659a56054345b32fc10503000000 never_started: false is_detached: true placement_group_id: null repr_name: ''ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.50.34 pid: 303 name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#NGyLdh ray_namespace: serve class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct actor_id: f85cc894e7c8b87a37a3da9203000000 never_started: false is_detached: true placement_group_id: null repr_name: ''ray.kill
. owner_id: 7e36611f04a1dd7649752f6ca82dbfb2d75ebdb4370831b8fc9c446c owner_ip_address: 172.31.52.1 node_ip_address: 172.31.50.34 pid: 612 name: SERVE_REPLICA::mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct#EYOIDW ray_namespace: serve class_name: ServeReplica:mosaicml--mpt-7b-instruct_mosaicml--mpt-7b-instruct actor_id: fff57bdd81a5b9190538fcd003000000 never_started: false is_detached: true placement_group_id: null repr_name: '' ...