Closed ijoffe closed 1 year ago
Having the same issue.
I remember fixing this error by updating transformers to 4.31
I'm on 4.31. I actually had this working a few days ago but now it's not.
Thanks for the replies! Apparently, one of my virtual environments was still on transformers 4.30.0, so upgrading to 4.31.0 fixed the issue.
For anyone else experiencing this, these were the package versions that solved the problem for me (from running pip list
):
Package Version
------------------------ ----------
accelerate 0.21.0
bitsandbytes 0.41.0
certifi 2023.7.22
charset-normalizer 3.2.0
cmake 3.27.0
filelock 3.12.2
fsspec 2023.6.0
huggingface-hub 0.16.4
idna 3.4
Jinja2 3.1.2
lit 16.0.6
MarkupSafe 2.1.3
mpmath 1.3.0
mypy-extensions 1.0.0
networkx 3.1
numpy 1.25.1
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
packaging 23.1
pip 22.3.1
psutil 5.9.5
pyre-extensions 0.0.29
PyYAML 6.0.1
regex 2023.6.3
requests 2.31.0
safetensors 0.3.1
scipy 1.11.1
setuptools 65.5.0
sympy 1.12
tokenizers 0.13.3
torch 2.0.1
tqdm 4.65.0
transformers 4.31.0
triton 2.0.0
typing_extensions 4.7.1
typing-inspect 0.9.0
urllib3 2.0.4
wheel 0.41.0
xformers 0.0.20
Thanks!
@ijoffe As I understand the "chat" model is built to have chat data as an input. jSimilar to ChatGPT
[
{"role":"system", "<content>"},
{"role":"user", "<content>"},
{"role":"assistant", "<content>"},
]
Have you figured out how to feed such type of data into the huggingface llama2 model?
Thanks!
@ijoffe Could you put up gist or paste in the script you ended up with to load the 4-bit models? I can probably piece it together from your original post but a complete example would be super helpful!
For sure! This code worked for me, here it is:
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
pipeline,
BitsAndBytesConfig,
)
import torch
name = "meta-llama/Llama-2-70b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(name)
tokenizer.pad_token_id = tokenizer.eos_token_id # for open-ended generation
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
name,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
generation_pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
trust_remote_code=True,
device_map="auto", # finds GPU
)
text = "any text " # prompt goes here
sequences = generation_pipe(
text,
max_length=128,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
do_sample=True,
top_k=10,
temperature=0.4,
top_p=0.9
)
print(sequences[0]["generated_text"])
can i run Llama-2-70b-chat-hf with 4 * RTX 3090? , is there any document i can refer to?
Not sure about any reference document, but those are 24GB GPUs right? I got this running on one 48GB GPU, so even with the parallelization overhead I bet you could get this running if you have 4.
ok, i'll try it, Thanks a lot!
Not sure about any reference document, but those are 24GB GPUs right? I got this running on one 48GB GPU, so even with the parallelization overhead I bet you could get this running if you have 4.
@ijoffe Did you run this with quantization or without? If you did use quantization, how many bits did you use?
@yanxiyue take a look at @ijoffe code snippet above. load_in_4bit=True is set in their quantization_config
@yanxiyue take a look at @ijoffe code snippet above. load_in_4bit=True is set in their quantization_config
thanks for the additional context!
For sure! This code worked for me, here it is:
from transformers import ( AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig, ) import torch name = "meta-llama/Llama-2-70b-chat-hf" tokenizer = AutoTokenizer.from_pretrained(name) tokenizer.pad_token_id = tokenizer.eos_token_id # for open-ended generation bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( name, quantization_config=bnb_config, device_map="auto", trust_remote_code=True, ) generation_pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, trust_remote_code=True, device_map="auto", # finds GPU ) text = "any text " # prompt goes here sequences = generation_pipe( text, max_length=128, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, do_sample=True, top_k=10, temperature=0.4, top_p=0.9 ) print(sequences[0]["generated_text"])
Hey @ijoffe, What is the exact purpose for passing the pad_token_id and eos_token_id? THanks
Hey @hassanzadeh, this just ensures the tokenizer and model are on the same page hen it comes to the special tokens. I'm not sure if it's required, but it theoretically ensures the LLM stops generating output once the EOS token is reached.
Hey @hassanzadeh, this just ensures the tokenizer and model are on the same page hen it comes to the special tokens. I'm not sure if it's required, but it theoretically ensures the LLM stops generating output once the EOS token is reached.
I see, thanks for your quick response :)
Hey @ijoffe
After quantizing the model to 4bit , do you think this could be run with only vCPUs , if yes what would be the specs for CPU and VRAM ?
ijoffe what is your deepspeed, accelerate and transformers version? I still get the following error
FP4 quantization state not initialized. Please call .cuda() or .to(device) on the LinearFP4 layer first.
the following combination works transformers and accelerate versions are as belpow transformers==4.31.0 accelerate==0.21.0
Hi @m4dc4p , @Mega4alik , @robinsonmhj , @yanxiyue , @hassanzadeh for me, i am getting this issue : Access to model meta-llama/Llama-2-70b-chat-hf is restricted. You must be authenticated to access it. I got access , but how can i pass toekn or username in above code ? can some one please help in this ?
For sure! This code worked for me, here it is:
from transformers import ( AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig, ) import torch name = "meta-llama/Llama-2-70b-chat-hf" tokenizer = AutoTokenizer.from_pretrained(name) tokenizer.pad_token_id = tokenizer.eos_token_id # for open-ended generation bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, ) model = AutoModelForCausalLM.from_pretrained( name, quantization_config=bnb_config, device_map="auto", trust_remote_code=True, ) generation_pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, trust_remote_code=True, device_map="auto", # finds GPU ) text = "any text " # prompt goes here sequences = generation_pipe( text, max_length=128, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, do_sample=True, top_k=10, temperature=0.4, top_p=0.9 ) print(sequences[0]["generated_text"])
Hey @ijoffe, What is the exact purpose for passing the pad_token_id and eos_token_id? THanks
Hi @hassanzadeh / @yanxiyue for me, i am getting this issue : Access to model meta-llama/Llama-2-70b-chat-hf is restricted. You must be authenticated to access it. I got access , but how can i pass toekn or username in above code ? can some one please help in this ?
Has anyone been able to get the LLaMA-2 70B model to run inference in 4-bit quantization using HuggingFace? Here are some variations of code that I've tried based on various guides:
When running all of these variations, I am able to load the model on a 48GB GPU, but making the following call produces an error:
The error message is as follows:
What am I doing wrong? Is this even possible? Has anyone been able to get this 4-bit quantization working?