LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32 Aborted

mathysferrato commented 2 months ago

Hi I tried to train a quantized model fitting my VRAM as I have a GTX 1070ti, but I got this error that I did not have on a friend's computer who has an RTX 2070 (so same VRAM but more recent) :

I found that people were having a similar issue on other posts (such as https://github.com/state-spaces/mamba/issues/173) but the only real solution was to buy a new GPU, is it really the only way ?

I am using a conda env I set up with the commands on the GitHub main page.

I found this on the main page so it should work with my gpu :

Like I know it's linked to the gpu architecture and the compute capability (6.1 for the gtx 1070ti) but isn't there a way to change some lines in the code of the packages to make it work ?

Thanks for your help,

danielhanchen commented 2 months ago

I think try installing the first ever release (under tags) of Unsloth - it might or might not work

ssancheti commented 4 weeks ago

Running into the same issue:

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.
Unsloth: Will map <|im_end|> to EOS = <|end_of_text|>.
/usr/local/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
num_proc must be <= 1. Reducing num_proc to 1 for dataset of size 1.
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 1
 "-____-"     Number of trainable parameters = 41,943,040
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.0.
   \\   /|    GPU: Tesla P100-PCIE-16GB. Max memory: 15.888 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 6.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.5196
  0%|          | 0/1 [00:00<?, ?it/s]LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32

danielhanchen commented 3 weeks ago

Wait P100s don't have bfloat16 - you have to change fp16 = True and bf16 = False during training

unslothai / unsloth

LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32 Aborted #642