meta-llama / codellama

Inference code for CodeLlama models
Other
16.06k stars 1.88k forks source link

I have missing CUDA library files that are causing crash when I start torchrun #231

Open ichibrosan opened 6 months ago

ichibrosan commented 6 months ago

My operating system is Ubuntu Linux 22.04 $ cat /etc/os-release PRETTY_NAME="Ubuntu 22.04.4 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.4 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy In order to get CUDA pytorch and CUDA under conda, I am using Active State Python with the following configuration: image I am starting up with:

!/bin/sh

torchrun --nproc_per_node 1 example_instructions.py \ --ckpt_dir CodeLlama-7b-Instruct/ \ --tokenizer_path CodeLlama-7b-Instruct/tokenizer_model \ --max_seq_len 512 --max_batch_size 4 and torchrun is crashing over missing libraries.

Traceback (most recent call last): File "/home/doug/.cache/activestate/cb772d80/usr/bin/torchrun", line 5, in import torch.distributed.run File "/home/doug/.cache/activestate/cb772d80/usr/lib/python3.10/site-packages/torch/init.py", line 191, in _load_global_deps() File "/home/doug/.cache/activestate/cb772d80/usr/lib/python3.10/site-packages/torch/init.py", line 153, in _load_global_deps ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL) File "/home/doug/.cache/activestate/cb772d80/usr/lib/python3.10/ctypes/init.py", line 374, in init self._handle = _dlopen(self._name, mode) OSError: libcufft.so.10: cannot open shared object file: No such file or directory