skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.53k stars 465 forks source link

[Clouds] Launch Error with Paperspace Backend #3498

Closed sachin31198 closed 4 months ago

sachin31198 commented 4 months ago

It seems that skypilot no longer works with the paperspace cloud, I keep running into the following error while trying to launch a training job with skypilot and paperspace backend:

I 04-29 11:56:23 optimizer.py:716] Estimated cost: $3.2 / hour
I 04-29 11:56:23 optimizer.py:716] 
I 04-29 11:56:23 optimizer.py:839] Considered resources (1 node):
I 04-29 11:56:23 optimizer.py:909] -------------------------------------------------------------------------------------------------
I 04-29 11:56:23 optimizer.py:909]  CLOUD        INSTANCE   vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE        COST ($)   CHOSEN   
I 04-29 11:56:23 optimizer.py:909] -------------------------------------------------------------------------------------------------
I 04-29 11:56:23 optimizer.py:909]  Paperspace   A100-80G   12      80        A100-80GB:1    East Coast (NY2)   3.18          ✔     
I 04-29 11:56:23 optimizer.py:909] -------------------------------------------------------------------------------------------------
I 04-29 11:56:23 optimizer.py:909] 
Launching a new cluster 'axolotl'. Proceed? [Y/n]: Y
I 04-29 11:56:25 cloud_vm_ray_backend.py:4237] Creating a new cluster: 'axolotl' [1x Paperspace(A100-80G, {'A100-80GB': 1})].
I 04-29 11:56:25 cloud_vm_ray_backend.py:4237] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 04-29 11:56:25 cloud_vm_ray_backend.py:1363] To view detailed progress: tail -n100 -f /home/vscode/sky_logs/sky-2024-04-29-11-56-23-831638/provision.log
I 04-29 11:56:26 provisioner.py:77] Launching on Paperspace East Coast (NY2) (all zones)
W 04-29 11:56:26 config.py:24] Paperspace only supports disk sizes[100, 250, 500, 1000, 2000], upsizing from 256 to 500
E 04-29 11:56:29 provisioner.py:92] Failed to configure 'axolotl' on Paperspace Region(name='East Coast (NY2)') (all zones) with the following error:
E 04-29 11:56:29 provisioner.py:92] sky.provision.paperspace.utils.PaperspaceCloudError: Response cannot be parsed into JSON. Status code: 403; reason: Forbidden; content: <!DOCTYPE html>
E 04-29 11:56:29 provisioner.py:92] <!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
E 04-29 11:56:29 provisioner.py:92] <!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
E 04-29 11:56:29 provisioner.py:92] <!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
E 04-29 11:56:29 provisioner.py:92] <!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
E 04-29 11:56:29 provisioner.py:92] <head>
E 04-29 11:56:29 provisioner.py:92] <title>Attention Required! | Cloudflare</title>
E 04-29 11:56:29 provisioner.py:92] <meta charset="UTF-8" />
E 04-29 11:56:29 provisioner.py:92] <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
E 04-29 11:56:29 provisioner.py:92] <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
E 04-29 11:56:29 provisioner.py:92] <meta name="robots" content="noindex, nofollow" />
E 04-29 11:56:29 provisioner.py:92] <meta name="viewport" content="width=device-width,initial-scale=1" />
E 04-29 11:56:29 provisioner.py:92] <link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" />
E 04-29 11:56:29 provisioner.py:92] <!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" /><![endif]-->
E 04-29 11:56:29 provisioner.py:92] <style>body{margin:0;padding:0}</style>
E 04-29 11:56:29 provisioner.py:92] <!--[if gte IE 10]><!-->
E 04-29 11:56:29 provisioner.py:92] <script>
E 04-29 11:56:29 provisioner.py:92]   if (!navigator.cookieEnabled) {
E 04-29 11:56:29 provisioner.py:92]     window.addEventListener('DOMContentLoaded', function () {
E 04-29 11:56:29 provisioner.py:92]       var cookieEl = document.getElementById('cookie-alert');
E 04-29 11:56:29 provisioner.py:92]       cookieEl.style.display = 'block';
E 04-29 11:56:29 provisioner.py:92]     })
E 04-29 11:56:29 provisioner.py:92]   }
E 04-29 11:56:29 provisioner.py:92] </script>
E 04-29 11:56:29 provisioner.py:92] <!--<![endif]-->
E 04-29 11:56:29 provisioner.py:92] </head>
E 04-29 11:56:29 provisioner.py:92] <body>
E 04-29 11:56:29 provisioner.py:92] <div id="cf-wrapper">
E 04-29 11:56:29 provisioner.py:92] <div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please enable cookies.</div>
E 04-29 11:56:29 provisioner.py:92] <div id="cf-error-details" class="cf-error-details-wrapper">
E 04-29 11:56:29 provisioner.py:92] <div class="cf-wrapper cf-header cf-error-overview">
E 04-29 11:56:29 provisioner.py:92] <h1 data-translate="block_headline">Sorry, you have been blocked</h1>
E 04-29 11:56:29 provisioner.py:92] <h2 class="cf-subheadline"><span data-translate="unable_to_access">You are unable to access</span> paperspace.com</h2>
E 04-29 11:56:29 provisioner.py:92] </div>
E 04-29 11:56:29 provisioner.py:92] <div class="cf-section cf-highlight">
E 04-29 11:56:29 provisioner.py:92] <div class="cf-wrapper">
E 04-29 11:56:29 provisioner.py:92] <div class="cf-screenshot-container cf-screenshot-full">
E 04-29 11:56:29 provisioner.py:92] <span class="cf-no-screenshot error"></span>
E 04-29 11:56:29 provisioner.py:92] </div>
E 04-29 11:56:29 provisioner.py:92] </div>
E 04-29 11:56:29 provisioner.py:92] </div>
E 04-29 11:56:29 provisioner.py:92] <div class="cf-section cf-wrapper">
E 04-29 11:56:29 provisioner.py:92] <div class="cf-columns two">
E 04-29 11:56:29 provisioner.py:92] <div class="cf-column">
E 04-29 11:56:29 provisioner.py:92] <h2 data-translate="blocked_why_headline">Why have I been blocked?</h2>
E 04-29 11:56:29 provisioner.py:92] <p data-translate="blocked_why_detail">This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.</p>
E 04-29 11:56:29 provisioner.py:92] </div>
E 04-29 11:56:29 provisioner.py:92] <div class="cf-column">
E 04-29 11:56:29 provisioner.py:92] <h2 data-translate="blocked_resolve_headline">What can I do to resolve this?</h2>
E 04-29 11:56:29 provisioner.py:92] <p data-translate="blocked_resolve_detail">You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.</p>
E 04-29 11:56:29 provisioner.py:92] </div>
E 04-29 11:56:29 provisioner.py:92] </div>
E 04-29 11:56:29 provisioner.py:92] </div>
E 04-29 11:56:29 provisioner.py:92] <div class="cf-error-footer cf-wrapper w-240 lg:w-full py-10 sm:py-4 sm:px-8 mx-auto text-center sm:text-left border-solid border-0 border-t border-gray-300">
E 04-29 11:56:29 provisioner.py:92] <p class="text-13">
E 04-29 11:56:29 provisioner.py:92] <span class="cf-footer-item sm:block sm:mb-1">Cloudflare Ray ID: <strong class="font-semibold">87bf1bebafa93c01</strong></span>
E 04-29 11:56:29 provisioner.py:92] <span class="cf-footer-separator sm:hidden">&bull;</span>
E 04-29 11:56:29 provisioner.py:92] <span id="cf-footer-item-ip" class="cf-footer-item hidden sm:block sm:mb-1">
E 04-29 11:56:29 provisioner.py:92] Your IP:
E 04-29 11:56:29 provisioner.py:92] <button type="button" id="cf-footer-ip-reveal" class="cf-footer-ip-reveal-btn">Click to reveal</button>
E 04-29 11:56:29 provisioner.py:92] <span class="hidden" id="cf-footer-ip">106.51.82.164</span>
E 04-29 11:56:29 provisioner.py:92] <span class="cf-footer-separator sm:hidden">&bull;</span>
E 04-29 11:56:29 provisioner.py:92] </span>
E 04-29 11:56:29 provisioner.py:92] <span class="cf-footer-item sm:block sm:mb-1"><span>Performance &amp; security by</span> <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing" id="brand_link" target="_blank">Cloudflare</a></span>
E 04-29 11:56:29 provisioner.py:92] </p>
E 04-29 11:56:29 provisioner.py:92] <script>(function(){function d(){var b=a.getElementById("cf-footer-item-ip"),c=a.getElementById("cf-footer-ip-reveal");b&&"classList"in b&&(b.classList.remove("hidden"),c.addEventListener("click",function(){c.classList.add("hidden");a.getElementById("cf-footer-ip").classList.remove("hidden")}))}var a=document;document.addEventListener&&a.addEventListener("DOMContentLoaded",d)})();</script>
E 04-29 11:56:29 provisioner.py:92] </div>
E 04-29 11:56:29 provisioner.py:92] </div>
E 04-29 11:56:29 provisioner.py:92] </div>
E 04-29 11:56:29 provisioner.py:92] <script>
E 04-29 11:56:29 provisioner.py:92]   window._cf_translation = {};
E 04-29 11:56:29 provisioner.py:92]   
E 04-29 11:56:29 provisioner.py:92]   
E 04-29 11:56:29 provisioner.py:92] </script>
E 04-29 11:56:29 provisioner.py:92] </body>
E 04-29 11:56:29 provisioner.py:92] </html>
E 04-29 11:56:29 provisioner.py:92] 
W 04-29 11:56:36 cloud_vm_ray_backend.py:2028] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in East Coast (NY2). Try changing resource requirements or use another region.
W 04-29 11:56:36 cloud_vm_ray_backend.py:2037] 
W 04-29 11:56:36 cloud_vm_ray_backend.py:2037] Provision failed for 1x Paperspace(A100-80G, {'A100-80GB': 1}) in East Coast (NY2). Trying other locations (if any).
Clusters
No existing clusters.

sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Paperspace({'A100-80GB': 1})
To keep retrying until the cluster is up, use the `--retry-until-up` flag.

Version & Commit info:

asaiacai commented 4 months ago

hi @sachin31198 , I added the paperspace integration. There was a bad Cloudflare rule that was flagging startup scripts for skypilot and blocking any future requests. I reached out to the team and they patched it on their end. I'm able to launch instances on paperspace now on my end but can you let me know if it works for you?

sachin31198 commented 4 months ago

Hi @asaiacai, Thanks for this, but i am now facing another issue while trying to provision A100:80GB on paperspace, please find the logs and the config below:

I 05-01 18:43:17 optimizer.py:716] Estimated cost: $3.2 / hour
I 05-01 18:43:17 optimizer.py:716] 
I 05-01 18:43:17 optimizer.py:839] Considered resources (1 node):
I 05-01 18:43:17 optimizer.py:909] -------------------------------------------------------------------------------------------------
I 05-01 18:43:17 optimizer.py:909]  CLOUD        INSTANCE   vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE        COST ($)   CHOSEN   
I 05-01 18:43:17 optimizer.py:909] -------------------------------------------------------------------------------------------------
I 05-01 18:43:17 optimizer.py:909]  Paperspace   A100-80G   12      80        A100-80GB:1    East Coast (NY2)   3.18          ✔     
I 05-01 18:43:17 optimizer.py:909] -------------------------------------------------------------------------------------------------
I 05-01 18:43:17 optimizer.py:909] 
Launching a new cluster 'axolotl'. Proceed? [Y/n]: Y
I 05-01 18:43:20 cloud_vm_ray_backend.py:4237] Creating a new cluster: 'axolotl' [1x Paperspace(A100-80G, {'A100-80GB': 1})].
I 05-01 18:43:20 cloud_vm_ray_backend.py:4237] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 05-01 18:43:20 cloud_vm_ray_backend.py:1363] To view detailed progress: tail -n100 -f /home/vscode/sky_logs/sky-2024-05-01-18-43-17-454001/provision.log
I 05-01 18:43:22 provisioner.py:77] Launching on Paperspace East Coast (NY2) (all zones)
W 05-01 18:43:22 config.py:24] Paperspace only supports disk sizes[100, 250, 500, 1000, 2000], upsizing from 256 to 500
W 05-01 18:43:36 instance.py:143] run_instances error: BAD_REQUEST: Template not available for machine type.
W 05-01 18:43:43 cloud_vm_ray_backend.py:2028] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in East Coast (NY2). Try changing resource requirements or use another region.
W 05-01 18:43:43 cloud_vm_ray_backend.py:2037] 
W 05-01 18:43:43 cloud_vm_ray_backend.py:2037] Provision failed for 1x Paperspace(A100-80G, {'A100-80GB': 1}) in East Coast (NY2). Trying other locations (if any).
Clusters
No existing clusters.

sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Paperspace({'A100-80GB': 1})
To keep retrying until the cluster is up, use the `--retry-until-up` flag.

The config:


name: llama3-llm-domain-adaptation

num_nodes: 1

resources:
  accelerators: A100-80GB
  cloud: paperspace 
  region: east coast (ny2)

workdir: train

file_mounts:
  /datasets: ./datasets

setup: |
  docker pull winglian/axolotl:main-py3.10-cu118-2.0.1

run: |
  docker run --gpus all \
    -v ~/sky_workdir:/sky_workdir \
    -v /root/.cache:/root/.cache \
    -v /datasets:/datasets \
    -v /output:/output \
    winglian/axolotl:main-py3.10-cu118-2.0.1 \
    accelerate launch -m axolotl.cli.train /sky_workdir/models/mistral/domain-adapt.yaml

envs:
  HF_TOKEN:  
  BUCKET:
sachin31198 commented 4 months ago

@asaiacai, I have been getting this error with every gpu that paperspace has to offer the account credentials are same as it was before when they were working:

I 05-02 11:44:01 optimizer.py:693] == Optimizer ==
I 05-02 11:44:01 optimizer.py:716] Estimated cost: $2.3 / hour
I 05-02 11:44:01 optimizer.py:716] 
I 05-02 11:44:01 optimizer.py:839] Considered resources (1 node):
I 05-02 11:44:01 optimizer.py:909] -------------------------------------------------------------------------------------------------
I 05-02 11:44:01 optimizer.py:909]  CLOUD        INSTANCE   vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE        COST ($)   CHOSEN   
I 05-02 11:44:01 optimizer.py:909] -------------------------------------------------------------------------------------------------
I 05-02 11:44:01 optimizer.py:909]  Paperspace   V100-32G   8       32        V100-32GB:1    East Coast (NY2)   2.30          ✔     
I 05-02 11:44:01 optimizer.py:909] -------------------------------------------------------------------------------------------------
I 05-02 11:44:01 optimizer.py:909] 
Launching a new cluster 'axolotl'. Proceed? [Y/n]: Y
I 05-02 11:44:03 cloud_vm_ray_backend.py:4237] Creating a new cluster: 'axolotl' [1x Paperspace(V100-32G, {'V100-32GB': 1})].
I 05-02 11:44:03 cloud_vm_ray_backend.py:4237] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 05-02 11:44:03 cloud_vm_ray_backend.py:1363] To view detailed progress: tail -n100 -f /home/vscode/sky_logs/sky-2024-05-02-11-44-01-052697/provision.log
I 05-02 11:44:03 provisioner.py:77] Launching on Paperspace East Coast (NY2) (all zones)
W 05-02 11:44:03 config.py:24] Paperspace only supports disk sizes[100, 250, 500, 1000, 2000], upsizing from 256 to 500
W 05-02 11:44:09 instance.py:143] run_instances error: BAD_REQUEST: User not authorized for requested machine type and template combination.
W 05-02 11:44:16 cloud_vm_ray_backend.py:2028] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in East Coast (NY2). Try changing resource requirements or use another region.
W 05-02 11:44:16 cloud_vm_ray_backend.py:2037] 
W 05-02 11:44:16 cloud_vm_ray_backend.py:2037] Provision failed for 1x Paperspace(V100-32G, {'V100-32GB': 1}) in East Coast (NY2). Trying other locations (if any).
Clusters
No existing clusters.

sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Paperspace({'V100-32GB': 1})
To keep retrying until the cluster is up, use the `--retry-until-up` flag.
asaiacai commented 4 months ago

@sachin31198 Can you run the following and paste the output here? I made a PR to update the machine templates ids but I want to double check that the ones that I'm referencing are the same as what's visible to you. Also just to confirm, you can launch the same instances in the console.

curl -X GET 'https://api.paperspace.io/templates/getTemplates' -H 'X-Api-Key: <PAPERSPACE_API_KEY>'
sachin31198 commented 4 months ago

[{"id":"tz0ireoj","name":"paperspace/tz0ireoj","label":"Ubuntu 20.04 Desktop","os":"Ubuntu 20.04 Desktop","dtCreated":"2021-10-21T05:52:13.579Z"},{"id":"tv00h6iv","name":"paperspace/tv00h6iv","label":"Windows 2012 R2 Grid","os":"Windows 2012 R2 - Licensed","dtCreated":"2020-07-30T17:34:04.385Z"},{"id":"ta1b3le7","name":"paperspace/ta1b3le7","label":"Windows 10 Pro","os":"Windows 10 (Pro) - Unlicensed","dtCreated":"2019-05-31T15:46:57.423Z"},{"id":"tnr2oh1m","name":"paperspace/tnr2oh1m","label":"Windows 10","os":"Windows 10 (Server 2022) - Licensed","dtCreated":"2021-04-08T16:39:03.908Z"},{"id":"txlizc2f","name":"paperspace/txlizc2f","label":"Parsec","os":"Windows 10 (Server 2022) - Licensed (Parsec)","dtCreated":"2022-02-17T02:59:55.333Z"},{"id":"t9taj00e","name":"paperspace/t9taj00e","label":null,"os":"Centos 7 Server","dtCreated":"2022-01-23T20:47:06.707Z"},{"id":"t04azgph","name":"paperspace/t04azgph","label":"Ubuntu 18.04 Server","os":"Ubuntu 18.04 Server","dtCreated":"2018-06-15T06:00:34.531Z"},{"id":"tmun4o2g","name":"paperspace/tmun4o2g","label":"Ubuntu 22.04 GPU Worker","os":"Ubuntu 22.04 Server","dtCreated":"2020-06-21T20:35:00.467Z"},{"id":"t0nspur5","name":"paperspace/t0nspur5","label":"Ubuntu 22.04 Server","os":"Ubuntu 22.04 Server","dtCreated":"2021-10-20T00:50:49.780Z"},{"id":"tkni3aa4","name":"paperspace/tkni3aa4","label":"Ubuntu 20.04 Server","os":"Ubuntu 20.04 Server","dtCreated":"2021-10-20T00:50:49.780Z"},{"id":"tpi7gqht","name":"paperspace/tpi7gqht","label":"Ubuntu 22.04 CPU Worker","os":"Ubuntu 22.04 Server","dtCreated":"2020-07-08T18:18:08.248Z"},{"id":"tvimtol9","name":"paperspace/tvimtol9","label":"Ubuntu 22.04 ML in a Box","os":"Ubuntu 22.04 MLiaB","dtCreated":"2023-12-20T08:40:41.889Z"},{"id":"tqqsxr6b","name":"paperspace/tqqsxr6b","label":"Ubuntu 22.04 ML in a Box","os":"Ubuntu 22.04 MLiaB","dtCreated":"2024-04-16T22:57:20.194Z"},{"id":"t7vp562h","name":"paperspace/t7vp562h","label":"Ubuntu 22.04 ML in a Box","os":"Ubuntu 22.04 MLiaB","dtCreated":"2024-04-16T23:09:40.234Z"},{"id":"t5dzjumv","name":"paperspace/t5dzjumv","label":"Ubuntu 22.04 GPU Worker","os":"Ubuntu 22.04 Server","dtCreated":"2024-04-17T17:47:50.351Z"},{"id":"twnlo3zj","name":"paperspace/twnlo3zj","label":"Ml in a Box 20.04","os":"Ubuntu 20.04 MLiaB","dtCreated":"2021-10-14T23:50:00.225Z"},{"id":"tilqt47t","name":"paperspace/tilqt47t","label":"Ubuntu 22.04 ML in a Box","os":"Ubuntu 22.04 MLiaB","dtCreated":"2024-04-16T22:29:22.956Z"},{"id":"taoz1uxr","name":"paperspace/taoz1uxr","label":"Windows 10","os":"Windows 10 (Server 2022) - Licensed","dtCreated":"2019-02-08T18:00:34.729Z"},{"id":"tk9izniv","name":"paperspace/tk9izniv","label":"Windows 10","os":"Windows 10 (Server 2022) - Licensed","dtCreated":"2019-02-08T17:59:04.036Z"},{"id":"tl1h5hec","name":"paperspace/tl1h5hec","label":"Windows 10 Pro","os":"Windows 10 (Pro) - Unlicensed","dtCreated":"2019-05-16T17:41:22.919Z"},{"id":"tnupxjzz","name":"tnupxjzz","label":"Gateway-Template","os":"","teamId":"t2pgbhamnt","userId":null,"region":"East Coast (NY2)","dtCreated":"2023-10-30T09:11:44.319Z"},{"id":"twzr3fi5","name":"twzr3fi5","label":"AI-Dev-Ubuntu-22.04-CUDA-Docker-Pyenv-PDM","os":"","teamId":"t2pgbhamnt","userId":null,"region":"East Coast (NY2)","dtCreated":"2023-10-13T06:35:25.896Z"},{"id":"tgmpix4t","name":"tgmpix4t","label":"AI-Dev-Ubuntu-20.04-CUDA-Docker-Pyenv-Poetry","os":"","teamId":"t2pgbhamnt","userId":null,"region":"East Coast (NY2)","dtCreated":"2023-08-01T19:14:18.235Z"}]

@asaiacai This is the output after running the curl.

sachin31198 commented 4 months ago

@asaiacai It was an account related issue on my end, thanks for your help.