ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
31.98k stars 5.44k forks source link

[Autoscaler][GCP] Autofill GCP node type resources #27888

Open DmitriGekhtman opened 1 year ago

DmitriGekhtman commented 1 year ago

Description

For AWS, we look up a node's resource capacities automatically based on the instance type. https://github.com/ray-project/ray/blob/a3236b62254348ed1c09062d57be7a2fd8e6c4af/python/ray/autoscaler/_private/aws/node_provider.py#L595

It wouldn't hurt to implement the same logic for GCP, to make life simpler for users of Ray on GCP.

Use case

Simplify configuration for GCP VMs.

sushant-t commented 1 year ago

@DmitriGekhtman is this still available? I know this is quite a few months after posting, so I wanted to check!

DmitriGekhtman commented 1 year ago

This is indeed still open.

sushant-t commented 1 year ago

Perfect. Is there any protocol for marking myself as the one working on this task?

DmitriGekhtman commented 1 year ago

I've marked you as the assignee :)

sushant-t commented 1 year ago

@DmitriGekhtman awesome! I think I have implemented the majority of the logic. Had a couple of questions though.

  1. Do you know how I can recreate this available_node_types mock data for GCP? This is the mock data for AWS: https://github.com/ray-project/ray/blob/1012fbc375d55540fb944a1176a8a54af34182f6/python/ray/autoscaler/aws/example-multi-node-type.yaml#L20

  2. I noticed that the AWS logic uses a caching scheme in order to make a boto3 client available in list_ec2_instances: https://github.com/ray-project/ray/blob/1012fbc375d55540fb944a1176a8a54af34182f6/python/ray/autoscaler/_private/aws/node_provider.py#L85 With GCP, how can I make GCPCompute or GCPResource available to get a list of instance types from? In other words, can I just construct a new one, like here: https://github.com/ray-project/ray/blob/1012fbc375d55540fb944a1176a8a54af34182f6/python/ray/autoscaler/_private/gcp/node_provider.py#L78 or is there any similar caching logic that can be applied here?