skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.55k stars 468 forks source link

[GCP] Invalid value for field 'resource.instanceProperties.labels' #3526

Closed dzlab closed 4 months ago

dzlab commented 4 months ago

I'm trying to run the axolotl example on gcp, I do this

cd skypilot/llm/axolotl
HF_TOKEN=hf_ sky launch axolotl.yaml --env HF_TOKEN

But it is not able to create instances on gcp, it seems because GCP is rejecting the value for resource.instanceProperties.labels as it contains . which is not valid. This is the error message from the log

W 05-08 13:21:46 instance_utils.py:112] Got return code 'invalid' in us-east4-a: "Invalid value for field 'resource.instanceProperties.labels': ''. Label value 'firstname.lastname' violates format constraints. The value can only contain lowercase letters, numeric characters, underscores and dashes. The value can be at most 63 characters long. International characters are allowed"

This is the task full log

Task from YAML spec: axolotl.yaml
I 05-08 13:21:04 optimizer.py:694] == Optimizer ==
I 05-08 13:21:04 optimizer.py:705] Target: minimizing cost
I 05-08 13:21:04 optimizer.py:717] Estimated cost: $0.7 / hour
I 05-08 13:21:04 optimizer.py:717] 
I 05-08 13:21:04 optimizer.py:842] Considered resources (1 node):
I 05-08 13:21:04 optimizer.py:912] --------------------------------------------------------------------------------------------
I 05-08 13:21:04 optimizer.py:912]  CLOUD   INSTANCE        vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 05-08 13:21:04 optimizer.py:912] --------------------------------------------------------------------------------------------
I 05-08 13:21:04 optimizer.py:912]  GCP     g2-standard-4   4       16        L4:1           us-east4-a    0.70          ✔     
I 05-08 13:21:04 optimizer.py:912] --------------------------------------------------------------------------------------------
I 05-08 13:21:04 optimizer.py:912] 
Launching a new cluster 'sky-21d6-firstname.lastname'. Proceed? [Y/n]: Y
I 05-08 13:21:24 cloud_vm_ray_backend.py:4250] Creating a new cluster: 'sky-21d6-firstname.lastname' [1x GCP(g2-standard-4, {'L4': 1})].
I 05-08 13:21:24 cloud_vm_ray_backend.py:4250] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 05-08 13:21:27 cloud_vm_ray_backend.py:1371] To view detailed progress: tail -n100 -f /Users/firstname.lastname/sky_logs/sky-2024-05-08-13-20-58-375021/provision.log
I 05-08 13:21:31 provisioner.py:77] Launching on GCP us-east4 (us-east4-a)
W 05-08 13:21:46 instance_utils.py:112] Got return code 'invalid' in us-east4-a: "Invalid value for field 'resource.instanceProperties.labels': ''. Label value 'firstname.lastname' violates format constraints. The value can only contain lowercase letters, numeric characters, underscores and dashes. The value can be at most 63 characters long. International characters are allowed"
W 05-08 13:21:49 cloud_vm_ray_backend.py:2036] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in us-east4-a. Try changing resource requirements or use another zone.
W 05-08 13:21:49 cloud_vm_ray_backend.py:2045] 
W 05-08 13:21:49 cloud_vm_ray_backend.py:2045] Provision failed for 1x GCP(g2-standard-4, {'L4': 1}) in us-east4-a. Trying other locations (if any).

Not sure where this value is set in skypolit but we could avoid such error if we escape . and any other invalid character?

Version & Commit info:

concretevitamin commented 4 months ago

That's a great catch @dzlab! We should fix it. As a workaround, you could give the cluster a name by passing say -c finetune to sky launch.

dzlab commented 4 months ago

@concretevitamin thanks for the hint, but unfortunately I still get the same error which is preventing instance creation. Here is log, you can see it is using finetune as cluster name but it's not able create instances:

Task from YAML spec: axolotl.yaml
I 05-08 22:14:13 optimizer.py:694] == Optimizer ==
I 05-08 22:14:13 optimizer.py:705] Target: minimizing cost
I 05-08 22:14:13 optimizer.py:717] Estimated cost: $3.0 / hour
I 05-08 22:14:13 optimizer.py:717] 
I 05-08 22:14:13 optimizer.py:842] Considered resources (1 node):
I 05-08 22:14:13 optimizer.py:912] ---------------------------------------------------------------------------------------------
I 05-08 22:14:13 optimizer.py:912]  CLOUD   INSTANCE       vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN   
I 05-08 22:14:13 optimizer.py:912] ---------------------------------------------------------------------------------------------
I 05-08 22:14:13 optimizer.py:912]  GCP     n1-highmem-8   8       52        L4:1         us-central1-a   2.95          ✔     
I 05-08 22:14:13 optimizer.py:912] ---------------------------------------------------------------------------------------------
I 05-08 22:14:13 optimizer.py:912] 
Launching a new cluster 'finetune'. Proceed? [Y/n]: Y
I 05-08 22:14:26 cloud_vm_ray_backend.py:4250] Creating a new cluster: 'finetune' [1x GCP(n1-highmem-8, {'L4': 1})].
I 05-08 22:14:26 cloud_vm_ray_backend.py:4250] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 05-08 22:14:29 cloud_vm_ray_backend.py:1371] To view detailed progress: tail -n100 -f /Users/firstname.lastname/sky_logs/sky-2024-05-08-22-14-09-156922/provision.log
I 05-08 22:14:34 provisioner.py:77] Launching on GCP us-central1 (us-central1-a)
W 05-08 22:14:50 instance_utils.py:112] Got return code 'invalid' in us-central1-a: "Invalid value for field 'resource.instanceProperties.labels': ''. Label value 'firstname.lastname' violates format constraints. The value can only contain lowercase letters, numeric characters, underscores and dashes. The value can be at most 63 characters long. International characters are allowed"
W 05-08 22:14:52 cloud_vm_ray_backend.py:2036] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in us-central1-a. Try changing resource requirements or use another zone.
W 05-08 22:14:52 cloud_vm_ray_backend.py:2045] 
W 05-08 22:14:52 cloud_vm_ray_backend.py:2045] Provision failed for 1x GCP(n1-highmem-8, {'L4': 1}) in us-central1-a. Trying other locations (if any).
Michaelvll commented 4 months ago

@concretevitamin thanks for the hint, but unfortunately I still get the same error which is preventing instance creation. Here is log, you can see it is using finetune as cluster name but it's not able create instances:

Task from YAML spec: axolotl.yaml
I 05-08 22:14:13 optimizer.py:694] == Optimizer ==
I 05-08 22:14:13 optimizer.py:705] Target: minimizing cost
I 05-08 22:14:13 optimizer.py:717] Estimated cost: $3.0 / hour
I 05-08 22:14:13 optimizer.py:717] 
I 05-08 22:14:13 optimizer.py:842] Considered resources (1 node):
I 05-08 22:14:13 optimizer.py:912] ---------------------------------------------------------------------------------------------
I 05-08 22:14:13 optimizer.py:912]  CLOUD   INSTANCE       vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN   
I 05-08 22:14:13 optimizer.py:912] ---------------------------------------------------------------------------------------------
I 05-08 22:14:13 optimizer.py:912]  GCP     n1-highmem-8   8       52        L4:1         us-central1-a   2.95          ✔     
I 05-08 22:14:13 optimizer.py:912] ---------------------------------------------------------------------------------------------
I 05-08 22:14:13 optimizer.py:912] 
Launching a new cluster 'finetune'. Proceed? [Y/n]: Y
I 05-08 22:14:26 cloud_vm_ray_backend.py:4250] Creating a new cluster: 'finetune' [1x GCP(n1-highmem-8, {'L4': 1})].
I 05-08 22:14:26 cloud_vm_ray_backend.py:4250] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 05-08 22:14:29 cloud_vm_ray_backend.py:1371] To view detailed progress: tail -n100 -f /Users/firstname.lastname/sky_logs/sky-2024-05-08-22-14-09-156922/provision.log
I 05-08 22:14:34 provisioner.py:77] Launching on GCP us-central1 (us-central1-a)
W 05-08 22:14:50 instance_utils.py:112] Got return code 'invalid' in us-central1-a: "Invalid value for field 'resource.instanceProperties.labels': ''. Label value 'firstname.lastname' violates format constraints. The value can only contain lowercase letters, numeric characters, underscores and dashes. The value can be at most 63 characters long. International characters are allowed"
W 05-08 22:14:52 cloud_vm_ray_backend.py:2036] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in us-central1-a. Try changing resource requirements or use another zone.
W 05-08 22:14:52 cloud_vm_ray_backend.py:2045] 
W 05-08 22:14:52 cloud_vm_ray_backend.py:2045] Provision failed for 1x GCP(n1-highmem-8, {'L4': 1}) in us-central1-a. Trying other locations (if any).

Thanks for reporting this @dzlab! It seems indeed an issue for the username containing invalid characters. Just submit a PR for this. We will try to get it merged soon.