Closed sachin31198 closed 4 months ago
hi @sachin31198 , I added the paperspace integration. There was a bad Cloudflare rule that was flagging startup scripts for skypilot and blocking any future requests. I reached out to the team and they patched it on their end. I'm able to launch instances on paperspace now on my end but can you let me know if it works for you?
Hi @asaiacai, Thanks for this, but i am now facing another issue while trying to provision A100:80GB on paperspace, please find the logs and the config below:
I 05-01 18:43:17 optimizer.py:716] Estimated cost: $3.2 / hour
I 05-01 18:43:17 optimizer.py:716]
I 05-01 18:43:17 optimizer.py:839] Considered resources (1 node):
I 05-01 18:43:17 optimizer.py:909] -------------------------------------------------------------------------------------------------
I 05-01 18:43:17 optimizer.py:909] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 05-01 18:43:17 optimizer.py:909] -------------------------------------------------------------------------------------------------
I 05-01 18:43:17 optimizer.py:909] Paperspace A100-80G 12 80 A100-80GB:1 East Coast (NY2) 3.18 ✔
I 05-01 18:43:17 optimizer.py:909] -------------------------------------------------------------------------------------------------
I 05-01 18:43:17 optimizer.py:909]
Launching a new cluster 'axolotl'. Proceed? [Y/n]: Y
I 05-01 18:43:20 cloud_vm_ray_backend.py:4237] Creating a new cluster: 'axolotl' [1x Paperspace(A100-80G, {'A100-80GB': 1})].
I 05-01 18:43:20 cloud_vm_ray_backend.py:4237] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 05-01 18:43:20 cloud_vm_ray_backend.py:1363] To view detailed progress: tail -n100 -f /home/vscode/sky_logs/sky-2024-05-01-18-43-17-454001/provision.log
I 05-01 18:43:22 provisioner.py:77] Launching on Paperspace East Coast (NY2) (all zones)
W 05-01 18:43:22 config.py:24] Paperspace only supports disk sizes[100, 250, 500, 1000, 2000], upsizing from 256 to 500
W 05-01 18:43:36 instance.py:143] run_instances error: BAD_REQUEST: Template not available for machine type.
W 05-01 18:43:43 cloud_vm_ray_backend.py:2028] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in East Coast (NY2). Try changing resource requirements or use another region.
W 05-01 18:43:43 cloud_vm_ray_backend.py:2037]
W 05-01 18:43:43 cloud_vm_ray_backend.py:2037] Provision failed for 1x Paperspace(A100-80G, {'A100-80GB': 1}) in East Coast (NY2). Trying other locations (if any).
Clusters
No existing clusters.
sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Paperspace({'A100-80GB': 1})
To keep retrying until the cluster is up, use the `--retry-until-up` flag.
name: llama3-llm-domain-adaptation
num_nodes: 1
resources:
accelerators: A100-80GB
cloud: paperspace
region: east coast (ny2)
workdir: train
file_mounts:
/datasets: ./datasets
setup: |
docker pull winglian/axolotl:main-py3.10-cu118-2.0.1
run: |
docker run --gpus all \
-v ~/sky_workdir:/sky_workdir \
-v /root/.cache:/root/.cache \
-v /datasets:/datasets \
-v /output:/output \
winglian/axolotl:main-py3.10-cu118-2.0.1 \
accelerate launch -m axolotl.cli.train /sky_workdir/models/mistral/domain-adapt.yaml
envs:
HF_TOKEN:
BUCKET:
@asaiacai, I have been getting this error with every gpu that paperspace has to offer the account credentials are same as it was before when they were working:
I 05-02 11:44:01 optimizer.py:693] == Optimizer ==
I 05-02 11:44:01 optimizer.py:716] Estimated cost: $2.3 / hour
I 05-02 11:44:01 optimizer.py:716]
I 05-02 11:44:01 optimizer.py:839] Considered resources (1 node):
I 05-02 11:44:01 optimizer.py:909] -------------------------------------------------------------------------------------------------
I 05-02 11:44:01 optimizer.py:909] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 05-02 11:44:01 optimizer.py:909] -------------------------------------------------------------------------------------------------
I 05-02 11:44:01 optimizer.py:909] Paperspace V100-32G 8 32 V100-32GB:1 East Coast (NY2) 2.30 ✔
I 05-02 11:44:01 optimizer.py:909] -------------------------------------------------------------------------------------------------
I 05-02 11:44:01 optimizer.py:909]
Launching a new cluster 'axolotl'. Proceed? [Y/n]: Y
I 05-02 11:44:03 cloud_vm_ray_backend.py:4237] Creating a new cluster: 'axolotl' [1x Paperspace(V100-32G, {'V100-32GB': 1})].
I 05-02 11:44:03 cloud_vm_ray_backend.py:4237] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 05-02 11:44:03 cloud_vm_ray_backend.py:1363] To view detailed progress: tail -n100 -f /home/vscode/sky_logs/sky-2024-05-02-11-44-01-052697/provision.log
I 05-02 11:44:03 provisioner.py:77] Launching on Paperspace East Coast (NY2) (all zones)
W 05-02 11:44:03 config.py:24] Paperspace only supports disk sizes[100, 250, 500, 1000, 2000], upsizing from 256 to 500
W 05-02 11:44:09 instance.py:143] run_instances error: BAD_REQUEST: User not authorized for requested machine type and template combination.
W 05-02 11:44:16 cloud_vm_ray_backend.py:2028] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in East Coast (NY2). Try changing resource requirements or use another region.
W 05-02 11:44:16 cloud_vm_ray_backend.py:2037]
W 05-02 11:44:16 cloud_vm_ray_backend.py:2037] Provision failed for 1x Paperspace(V100-32G, {'V100-32GB': 1}) in East Coast (NY2). Trying other locations (if any).
Clusters
No existing clusters.
sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Paperspace({'V100-32GB': 1})
To keep retrying until the cluster is up, use the `--retry-until-up` flag.
@sachin31198 Can you run the following and paste the output here? I made a PR to update the machine templates ids but I want to double check that the ones that I'm referencing are the same as what's visible to you. Also just to confirm, you can launch the same instances in the console.
curl -X GET 'https://api.paperspace.io/templates/getTemplates' -H 'X-Api-Key: <PAPERSPACE_API_KEY>'
[{"id":"tz0ireoj","name":"paperspace/tz0ireoj","label":"Ubuntu 20.04 Desktop","os":"Ubuntu 20.04 Desktop","dtCreated":"2021-10-21T05:52:13.579Z"},{"id":"tv00h6iv","name":"paperspace/tv00h6iv","label":"Windows 2012 R2 Grid","os":"Windows 2012 R2 - Licensed","dtCreated":"2020-07-30T17:34:04.385Z"},{"id":"ta1b3le7","name":"paperspace/ta1b3le7","label":"Windows 10 Pro","os":"Windows 10 (Pro) - Unlicensed","dtCreated":"2019-05-31T15:46:57.423Z"},{"id":"tnr2oh1m","name":"paperspace/tnr2oh1m","label":"Windows 10","os":"Windows 10 (Server 2022) - Licensed","dtCreated":"2021-04-08T16:39:03.908Z"},{"id":"txlizc2f","name":"paperspace/txlizc2f","label":"Parsec","os":"Windows 10 (Server 2022) - Licensed (Parsec)","dtCreated":"2022-02-17T02:59:55.333Z"},{"id":"t9taj00e","name":"paperspace/t9taj00e","label":null,"os":"Centos 7 Server","dtCreated":"2022-01-23T20:47:06.707Z"},{"id":"t04azgph","name":"paperspace/t04azgph","label":"Ubuntu 18.04 Server","os":"Ubuntu 18.04 Server","dtCreated":"2018-06-15T06:00:34.531Z"},{"id":"tmun4o2g","name":"paperspace/tmun4o2g","label":"Ubuntu 22.04 GPU Worker","os":"Ubuntu 22.04 Server","dtCreated":"2020-06-21T20:35:00.467Z"},{"id":"t0nspur5","name":"paperspace/t0nspur5","label":"Ubuntu 22.04 Server","os":"Ubuntu 22.04 Server","dtCreated":"2021-10-20T00:50:49.780Z"},{"id":"tkni3aa4","name":"paperspace/tkni3aa4","label":"Ubuntu 20.04 Server","os":"Ubuntu 20.04 Server","dtCreated":"2021-10-20T00:50:49.780Z"},{"id":"tpi7gqht","name":"paperspace/tpi7gqht","label":"Ubuntu 22.04 CPU Worker","os":"Ubuntu 22.04 Server","dtCreated":"2020-07-08T18:18:08.248Z"},{"id":"tvimtol9","name":"paperspace/tvimtol9","label":"Ubuntu 22.04 ML in a Box","os":"Ubuntu 22.04 MLiaB","dtCreated":"2023-12-20T08:40:41.889Z"},{"id":"tqqsxr6b","name":"paperspace/tqqsxr6b","label":"Ubuntu 22.04 ML in a Box","os":"Ubuntu 22.04 MLiaB","dtCreated":"2024-04-16T22:57:20.194Z"},{"id":"t7vp562h","name":"paperspace/t7vp562h","label":"Ubuntu 22.04 ML in a Box","os":"Ubuntu 22.04 MLiaB","dtCreated":"2024-04-16T23:09:40.234Z"},{"id":"t5dzjumv","name":"paperspace/t5dzjumv","label":"Ubuntu 22.04 GPU Worker","os":"Ubuntu 22.04 Server","dtCreated":"2024-04-17T17:47:50.351Z"},{"id":"twnlo3zj","name":"paperspace/twnlo3zj","label":"Ml in a Box 20.04","os":"Ubuntu 20.04 MLiaB","dtCreated":"2021-10-14T23:50:00.225Z"},{"id":"tilqt47t","name":"paperspace/tilqt47t","label":"Ubuntu 22.04 ML in a Box","os":"Ubuntu 22.04 MLiaB","dtCreated":"2024-04-16T22:29:22.956Z"},{"id":"taoz1uxr","name":"paperspace/taoz1uxr","label":"Windows 10","os":"Windows 10 (Server 2022) - Licensed","dtCreated":"2019-02-08T18:00:34.729Z"},{"id":"tk9izniv","name":"paperspace/tk9izniv","label":"Windows 10","os":"Windows 10 (Server 2022) - Licensed","dtCreated":"2019-02-08T17:59:04.036Z"},{"id":"tl1h5hec","name":"paperspace/tl1h5hec","label":"Windows 10 Pro","os":"Windows 10 (Pro) - Unlicensed","dtCreated":"2019-05-16T17:41:22.919Z"},{"id":"tnupxjzz","name":"tnupxjzz","label":"Gateway-Template","os":"","teamId":"t2pgbhamnt","userId":null,"region":"East Coast (NY2)","dtCreated":"2023-10-30T09:11:44.319Z"},{"id":"twzr3fi5","name":"twzr3fi5","label":"AI-Dev-Ubuntu-22.04-CUDA-Docker-Pyenv-PDM","os":"","teamId":"t2pgbhamnt","userId":null,"region":"East Coast (NY2)","dtCreated":"2023-10-13T06:35:25.896Z"},{"id":"tgmpix4t","name":"tgmpix4t","label":"AI-Dev-Ubuntu-20.04-CUDA-Docker-Pyenv-Poetry","os":"","teamId":"t2pgbhamnt","userId":null,"region":"East Coast (NY2)","dtCreated":"2023-08-01T19:14:18.235Z"}]
@asaiacai This is the output after running the curl.
@asaiacai It was an account related issue on my end, thanks for your help.
It seems that skypilot no longer works with the paperspace cloud, I keep running into the following error while trying to launch a training job with skypilot and paperspace backend:
Version & Commit info:
sky -v
: 1.0.0.dev20240421sky -c
: 18fc79b98b974ae10ccdeebb0ab6cc7f9792795