skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.72k stars 496 forks source link

LLama 3.1 example not working with runpod #3873

Open Stealthwriter opened 2 months ago

Stealthwriter commented 2 months ago

When I run llama 3.1 example with runpd I'm getting this error:

h/sky-key' root@69.30.85.136 -p 22035 -o StrictHostKeyChecking=no -o PasswordAuthentication =no -o ConnectTimeout=10s -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnFo rwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 uptime. Error: Connectio nRefusedError: [Errno 111] Connection refused D 08-26 15:10:34 provisioner.py:400] Retrying in 1 second... D 08-26 15:10:35 provisioner.py:323] Waiting for SSH to 69.30.85.136. Try: ssh -T -i '~/.ss h/sky-key' root@69.30.85.136 -p 22035 -o StrictHostKeyChecking=no -o PasswordAuthentication =no -o ConnectTimeout=10s -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnFo rwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 uptime. Error: Connectio nRefusedError: [Errno 111] Connection refused

What is the cause of it?

Version & Commit info:

cblmemo commented 2 months ago

I couldn't reproduce this on the latest master (6067e2bcd60c9ae48fa4fb883bbede4b98b4d545) but I encountered another issue, which seems to indicate there is no nvcc installed in the runpod image we chose. Left the logs here for reference:

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
Collecting flash-attn==2.5.9.post1
  Downloading flash_attn-2.5.9.post1.tar.gz (2.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.6/2.6 MB 26.9 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'error'
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [22 lines of output]
      fatal: not a git repository (or any of the parent directories): .git

      torch.__version__  = 2.3.0+cu121

      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-2ypxr2s0/flash-attn_dc621a75c8d24b60871cf595a9f62e64/setup.py", line 113, in <module>
          _, bare_metal_version = get_cuda_bare_metal_version(CUDA_HOME)
        File "/tmp/pip-install-2ypxr2s0/flash-attn_dc621a75c8d24b60871cf595a9f62e64/setup.py", line 65, in get_cuda_bare_metal_version
          raw_output = subprocess.check_output([cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True)
        File "/root/miniconda3/envs/vllm/lib/python3.10/subprocess.py", line 421, in check_output
          return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
        File "/root/miniconda3/envs/vllm/lib/python3.10/subprocess.py", line 503, in run
          with Popen(*popenargs, **kwargs) as process:
        File "/root/miniconda3/envs/vllm/lib/python3.10/subprocess.py", line 971, in __init__
          self._execute_child(args, executable, preexec_fn, close_fds,
        File "/root/miniconda3/envs/vllm/lib/python3.10/subprocess.py", line 1863, in _execute_child
          raise child_exception_type(errno_num, err_msg, err_filename)
      FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
Clusters
NAME                           LAUNCHED    RESOURCES                                                                  STATUS   AUTOSTOP  COMMAND                       
sky-c8a6-memory                2 mins ago  1x RunPod(2x_L4_SECURE, {'L4': 2}, disk_tier=best, disk_size=512, port...  UP       -         sky launch @temp/rp-bug.y...  
sky-serve-controller-402b1bba  4 hrs ago   1x AWS(m6i.xlarge, disk_size=200, ports=['30001-30020'])                   STOPPED  10m       sky serve up @temp/rp-bug...  

sky.exceptions.CommandError: Command /bin/bash -i /tmp/sky_setup_sky-2024-08-26-13-45-02-496160 2>&1 failed with return code 1.
Failed to setup with return code 1. Check the details in log: ~/sky_logs/sky-2024-08-26-13-45-02-496160/setup-82.221.170.242-30280.log

****** START Last lines of setup output ******
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
******* END Last lines of setup output *******
Stealthwriter commented 2 months ago

The issue was from Runpod, their A40s are the only pods making this error

cblmemo commented 2 months ago

We found that the issue is originate from a connectivity issue between pods within same region in RunPod. Just filed an issue on RunPod repository and lets see (runpod/runpod-python#337).