Open Stealthwriter opened 2 months ago
I couldn't reproduce this on the latest master (6067e2bcd60c9ae48fa4fb883bbede4b98b4d545
) but I encountered another issue, which seems to indicate there is no nvcc installed in the runpod image we chose. Left the logs here for reference:
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
Collecting flash-attn==2.5.9.post1
Downloading flash_attn-2.5.9.post1.tar.gz (2.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.6/2.6 MB 26.9 MB/s eta 0:00:00
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'error'
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [22 lines of output]
fatal: not a git repository (or any of the parent directories): .git
torch.__version__ = 2.3.0+cu121
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/tmp/pip-install-2ypxr2s0/flash-attn_dc621a75c8d24b60871cf595a9f62e64/setup.py", line 113, in <module>
_, bare_metal_version = get_cuda_bare_metal_version(CUDA_HOME)
File "/tmp/pip-install-2ypxr2s0/flash-attn_dc621a75c8d24b60871cf595a9f62e64/setup.py", line 65, in get_cuda_bare_metal_version
raw_output = subprocess.check_output([cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True)
File "/root/miniconda3/envs/vllm/lib/python3.10/subprocess.py", line 421, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/root/miniconda3/envs/vllm/lib/python3.10/subprocess.py", line 503, in run
with Popen(*popenargs, **kwargs) as process:
File "/root/miniconda3/envs/vllm/lib/python3.10/subprocess.py", line 971, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/root/miniconda3/envs/vllm/lib/python3.10/subprocess.py", line 1863, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc'
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
sky-c8a6-memory 2 mins ago 1x RunPod(2x_L4_SECURE, {'L4': 2}, disk_tier=best, disk_size=512, port... UP - sky launch @temp/rp-bug.y...
sky-serve-controller-402b1bba 4 hrs ago 1x AWS(m6i.xlarge, disk_size=200, ports=['30001-30020']) STOPPED 10m sky serve up @temp/rp-bug...
sky.exceptions.CommandError: Command /bin/bash -i /tmp/sky_setup_sky-2024-08-26-13-45-02-496160 2>&1 failed with return code 1.
Failed to setup with return code 1. Check the details in log: ~/sky_logs/sky-2024-08-26-13-45-02-496160/setup-82.221.170.242-30280.log
****** START Last lines of setup output ******
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
******* END Last lines of setup output *******
The issue was from Runpod, their A40s are the only pods making this error
We found that the issue is originate from a connectivity issue between pods within same region in RunPod. Just filed an issue on RunPod repository and lets see (runpod/runpod-python#337).
When I run llama 3.1 example with runpd I'm getting this error:
h/sky-key' root@69.30.85.136 -p 22035 -o StrictHostKeyChecking=no -o PasswordAuthentication =no -o ConnectTimeout=10s -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnFo rwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 uptime. Error: Connectio nRefusedError: [Errno 111] Connection refused D 08-26 15:10:34 provisioner.py:400] Retrying in 1 second... D 08-26 15:10:35 provisioner.py:323] Waiting for SSH to 69.30.85.136. Try: ssh -T -i '~/.ss h/sky-key' root@69.30.85.136 -p 22035 -o StrictHostKeyChecking=no -o PasswordAuthentication =no -o ConnectTimeout=10s -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnFo rwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 uptime. Error: Connectio nRefusedError: [Errno 111] Connection refused
What is the cause of it?
Version & Commit info:
sky -v
: PLEASE_FILL_INsky -c
: PLEASE_FILL_IN