Open mleiter696 opened 1 month ago
You're using the bitsandbytes 4bit version, which I'm unsure if sagemaker supports
I am pretty new to this so had no idea that second version existed...
tried unsloth/Meta-Llama-3.1-8B-Instruct
instead and now have
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU
which I guess is why I needed the bnb in the first place. I am running out of ideas, Nvidia T4 is the best GPU available to me, I guess I have to try running it on an EC2 without sagemaker.
I also tried
"HF_MODEL_ID": "unsloth/Meta-Llama-3.1-8B-Instruct",
"HF_MODEL_QUANTIZE" : "bitsandbytes" # [possible values: awq, eetq, exl2, gptq, marlin, bitsandbytes, bitsandbytes-nf4, bitsandbytes-fp4, fp8]
but I got RuntimeError: FlashAttention only supports Ampere GPUs or newer.
which is why I moved from the Meta-Llama version in the first place.
I see now that the Google Collab notebook only had the BNB version available, which I suppose is how to run that one.
I suppose my goal now is to try and make the BNB version work in sagemaker, but not sure what is missing/possible here.
I am trying to deploy unsloth models to sagemaker model endpoints on to a T4 GPU (very important as this is the best GPU that i can use).
I matched the transformers,tensor/pytorch version to that which is in the collab notebook.
When I try and deploy this model, I get the following error:
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4146x4096 and 1x12582912)
Do I need to change the quantize option? What could be wrong here?