Closed Abhaycnvrg closed 11 months ago
Hi @Abhaycnvrg, thanks for raising this issue!
Firstly - please change your authentication key and ~remove it from this example~ (I removed it but it will still be in the history)- it should be secret
We have many requests for help, which we can only attend at a decent pace if you help us too. Could you please:
transformers-cli env
in the terminal and copy-paste the outputtransformers
version: 4.36.0.dev0# Code to reproduce the error
model_config = transformers.AutoConfig.from_pretrained(
model_id,
use_auth_token=hf_auth)
bnb_config = transformers.BitsAndBytesConfig(load_in_4bit = True,
bnb_4bit_quant_tyoe = 'nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16)
model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
config=model_config,
quantization_config=bnb_config,
device_map='auto',
use_auth_token=hf_auth,
offload_folder="save_folder")
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, use_auth_token=hf_auth)
generate_text = transformers.pipeline(
model=model, tokenizer=tokenizer,
return_full_text=True,
task='text-generation',
temperature=0.1,
max_new_tokens=4096,
repetition_penalty=1.1)
print("loaded model")
llm = HuggingFacePipeline(pipeline=generate_text)
cc @younesbelkada
Hi everyone!
I used to face this issue sometimes when using Google colab and when libraries are not correctly installed. In case you are using Kaggle or Google Colab notebook can you try to delete the runtime and re-start it again? If not retry everything on a fresh new environment by making sure to install all the required packages pip install transformers accelerate bitsandbytes
we are using a workspace. not google colab. It would be helpful for us if you can tell us which
python == 3.9.16 / transformers == 4.35.0 / accelerate 0.25.dev0 (from source) / bitsandbytes 0.41.1
can you also run:
>>> from transformers.utils.import_utils import is_accelerate_available, is_bitsandbytes_available
>>> is_accelerate_available()
True
>>> is_bitsandbytes_available()
True
I tried in the exact same environment
Somehow, bitsandbytes isn't being made available @younesbelkada and @amyeroberts can you help? the code is below
!pip install -q -U bitsandbytes==0.41.1
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U accelerate==0.25.dev0
!pip install -q -U einops
!pip install -q -U safetensors
!pip install -q -U torch
!pip install -q -U xformers
!pip install -q -U langchain
!pip install -q -U ctransformers[cuda]
!pip install chromadb
!pip install sentence-transformers
!pip install -q -U accelerate
!pip install bitsandbytes
!pip install -i https://test.pypi.org/simple/ bitsandbytes
!pip install --upgrade langchain
!pip install transformers==4.35
#loading packges
from torch import cuda, bfloat16
import transformers
from transformers import StoppingCriteria, StoppingCriteriaList
import torch
from langchain.document_loaders import UnstructuredFileLoader
from langchain.chains.summarize import load_summarize_chain
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
from langchain import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
import accelerate
import bitsandbytes
base_model_id = "mistralai/Mistral-7B-Instruct-v0.1"
baseline = AutoModelForCausalLM.from_pretrained(base_model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
print("loaded all packages")
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
print("Printing Device...")
print(device)
print("loading model....")
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
model_id,
use_auth_token=hf_auth#,
# load_in_8bit=False
)
bnb_config = transformers.BitsAndBytesConfig(load_in_4bit = True,
bnb_4bit_quant_tyoe = 'nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16
)
model = transformers.AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
config=model_config,
quantization_config=bnb_config,
# load_in_8bit=False,
device_map='auto',
use_auth_token=hf_auth,
offload_folder="save_folder"
)
# Load model directly
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, use_auth_token=hf_auth)
generate_text = transformers.pipeline(
model=model, tokenizer=tokenizer,
return_full_text=True,
task='text-generation',
temperature=0.1,
max_new_tokens=4096,
repetition_penalty=1.1
)
print("loaded model")
llm = HuggingFacePipeline(pipeline=generate_text)
I see PyTorch Version (GPU?) - xxx (False)
- maybe your bitsandbytes install is broken due to CUDA / GPU hardware not properly detected. Are you able to run !nvidia-smi
inside your notebook and what is its outcome?
So it doesn't work without GPU? only on CPU for eg?
Yes bnb does not work on CPU, you need to have access to a GPU. You can for instance use free-tier google colab instances that provides a decent 16GB NVIDIA T4 GPU
What do you mean by bnb? You mean mistral doesn't work on CPU?
I meant bitsandbytes
, i.e. all quantization features such as load_in_8bit
or load_in_4bit
okay thanks! can mistral work without quantisation code? i mean, we just want to run the inference
On CPU yes, but it might be slow, please consider using Mistral-7b on a free tier Google colab instance using bitsandbytes 4bit
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "ybelkada/Mistral-7B-v0.1-bf16-sharded"
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)
I advise to use https://huggingface.co/ybelkada/Mistral-7B-v0.1-bf16-sharded as the weights are sharded with smaller shards (~2GB) otherwise it will lead to CPU OOM when trying to load mistral weights on google colab
Thanks for this code, but can you point me to a tutorial post which does inference from mistral for cpu only? We are using custom machines with a limited scalability so OOM should not be a problem
For CPU only
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "ybelkada/Mistral-7B-v0.1-bf16-sharded"
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True)
Should do the trick, if you want to load the model in bfloat16:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "ybelkada/Mistral-7B-v0.1-bf16-sharded"
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16)
Hey @younesbelkada , I tried with GPU but got this error
RuntimeError: Failed to import transformers.models.mistral.modeling_mistral because of the following error (look up to see its traceback): Failed to import transformers.integrations.peft because of the following error (look up to see its traceback): cannot import name 'dispatch_model' from 'accelerate' (unknown location)
transformers = 4.36.dev0 , 4.35, 4.35.2 accelerate = 0.25.0.dev0 bitsandbytes = 0.41.2.post2 python = 3.10.6
Hello @Abhaycnvrg
dispatch_model
should be still in accelerate init: https://github.com/huggingface/accelerate/blob/main/src/accelerate/__init__.py#L8
Can you share how did you installed accelerate?
Can you also try to uninstall accelerate and re-install it ? pip uninstall accelerate && pip install -U accelerate
Could you also share the full error traceback in case the error still persists?
get the same error by following these commands pip uninstall accelerate && pip install -U accelerate
my python version is 3.10.6 is that the cause of the problem? also, can you suggest me a container image with both python 3.9.16 and cuda installed in it, so that i can test with GPU? @younesbelkada and @amyeroberts
which is the torch and torchvision version i need to use with mistral and GPU?
Also, can you suggest which container image (from nvidia or docker hub) should I use for running this? are these ones https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-03.html#rel-23-03 or https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-07.html#rel-23-07 okay?
@Abhaycnvrg transformers officially supports python 3.8 and above. You can find images by search docker hub - the hugging face one for pytorch-gpu is here. The compatible versions of library packages can be found in setup.py. Running pip install transformers
will find and install the compatible packages - and warn you if this isn't possible based on different libraries' requirements. You don't need torchvision
for mistral - it's an LLM.
Thanks @amyeroberts , so if i do the following: -
from torch import cuda, bfloat16
import transformers
from transformers import StoppingCriteria, StoppingCriteriaList
import torch
from langchain.document_loaders import UnstructuredFileLoader
from langchain.chains.summarize import load_summarize_chain
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
from langchain import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
import accelerate
import bitsandbytes
#base_model_id = "mistralai/Mistral-7B-Instruct-v0.1"
#baseline = AutoModelForCausalLM.from_pretrained(base_model_id, device_map="auto")
#tokenizer = AutoTokenizer.from_pretrained(base_model_id)
print("loaded all packages")
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
print("Printing Device...")
print(device)
print("loading model....")
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
model_id = "ybelkada/Mistral-7B-v0.1-bf16-sharded"
model_config = transformers.AutoConfig.from_pretrained(
model_id,
use_auth_token=hf_auth#,
# load_in_8bit=False
)
bnb_config = transformers.BitsAndBytesConfig(load_in_4bit = True,
bnb_4bit_quant_tyoe = 'nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = transformers.AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, config=model_config, quantization_config=bnb_config,
device_map='auto',
use_auth_token=hf_auth,
offload_folder="save_folder"
)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, use_auth_token=hf_auth)
generate_text = transformers.pipeline(
model=model, tokenizer=tokenizer,
return_full_text=True,
task='text-generation',
temperature=0.1,
max_new_tokens=4096,
repetition_penalty=1.1
)
print("loaded model")
llm = HuggingFacePipeline(pipeline=generate_text)
it should work?
As you're running from the dev version of accelerate I can't guarantee that it will be compatible with all of the other packages. Why not run and find out? 🤷♀️
Sure we will try but if you have any other recommended configs, please do send
hi @jitender-cnvrg @Abhaycnvrg Thanks a lot for iterating, loading a model in 4bit / 8bit should work out of the box on a simple Free-tier Google colab instance. Make sure to select T4 on runtime type. I made a quick example here: https://colab.research.google.com/drive/1zia3Q9FXhNHOhdwA9p8zD4qgPEWkZvHl?usp=sharing and made sure it works.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformer version : 4.30,4.31,4.34,4.35 python version : 3.11.1, 3.11.5,3.8
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Run this
Expected behavior
we should get inference