huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.83k stars 27.19k forks source link

keyerror : mistral (for transformer version = 4.30) and Import Error Using `load_in_8bit=True` requires Accelerate: for transformer version > 4.30 #27376

Closed Abhaycnvrg closed 11 months ago

Abhaycnvrg commented 1 year ago

System Info

transformer version : 4.30,4.31,4.34,4.35 python version : 3.11.1, 3.11.5,3.8

Who can help?

No response

Information

Tasks

Reproduction

1. #loading packges
from torch import cuda, bfloat16
import transformers
from transformers import StoppingCriteria, StoppingCriteriaList
import torch
from langchain.document_loaders import UnstructuredFileLoader
from langchain.chains.summarize import load_summarize_chain
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
from langchain import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
import accelerate
base_model_id = "mistralai/Mistral-7B-Instruct-v0.1"
baseline = AutoModelForCausalLM.from_pretrained(base_model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
print("loaded all packages")
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
print("Printing Device...")
print(device)
print("loading model....")
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth#,
#    load_in_8bit=False    
)
bnb_config = transformers.BitsAndBytesConfig(load_in_4bit = True,
bnb_4bit_quant_tyoe = 'nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
#    load_in_8bit=False,
    device_map='auto',
    use_auth_token=hf_auth,
    offload_folder="save_folder"
)
# Load model directly
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, use_auth_token=hf_auth)
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True, 
    task='text-generation',
    temperature=0.1,  
    max_new_tokens=4096,  
    repetition_penalty=1.1 
)
print("loaded model")
llm = HuggingFacePipeline(pipeline=generate_text)

Run this

Expected behavior

we should get inference

amyeroberts commented 1 year ago

Hi @Abhaycnvrg, thanks for raising this issue!

Firstly - please change your authentication key and ~remove it from this example~ (I removed it but it will still be in the history)- it should be secret

amyeroberts commented 1 year ago

We have many requests for help, which we can only attend at a decent pace if you help us too. Could you please:

Abhaycnvrg commented 1 year ago
jitender-cnvrg commented 1 year ago
# Code to reproduce the error
model_config = transformers.AutoConfig.from_pretrained(
                model_id,
                use_auth_token=hf_auth)

bnb_config = transformers.BitsAndBytesConfig(load_in_4bit = True,
             bnb_4bit_quant_tyoe = 'nf4',
             bnb_4bit_use_double_quant=True,
             bnb_4bit_compute_dtype=bfloat16)

model = transformers.AutoModelForCausalLM.from_pretrained(
        model_id,
        trust_remote_code=True,
        config=model_config,
        quantization_config=bnb_config,
        device_map='auto',
        use_auth_token=hf_auth,
        offload_folder="save_folder")

tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, use_auth_token=hf_auth)
generate_text = transformers.pipeline(
                model=model, tokenizer=tokenizer,
                return_full_text=True, 
                task='text-generation',
                temperature=0.1,  
                max_new_tokens=4096,  
                repetition_penalty=1.1)
print("loaded model")
llm = HuggingFacePipeline(pipeline=generate_text)

Error

error

amyeroberts commented 1 year ago

cc @younesbelkada

younesbelkada commented 1 year ago

Hi everyone! I used to face this issue sometimes when using Google colab and when libraries are not correctly installed. In case you are using Kaggle or Google Colab notebook can you try to delete the runtime and re-start it again? If not retry everything on a fresh new environment by making sure to install all the required packages pip install transformers accelerate bitsandbytes

Abhaycnvrg commented 1 year ago

we are using a workspace. not google colab. It would be helpful for us if you can tell us which

  1. python version
  2. transformers version
  3. accelerate
  4. bitsandbytes are to be used. the exact numbers please.
younesbelkada commented 1 year ago

python == 3.9.16 / transformers == 4.35.0 / accelerate 0.25.dev0 (from source) / bitsandbytes 0.41.1

younesbelkada commented 1 year ago

can you also run:

>>> from transformers.utils.import_utils import is_accelerate_available, is_bitsandbytes_available
>>> is_accelerate_available()
True
>>> is_bitsandbytes_available()
True
Abhaycnvrg commented 1 year ago

I tried in the exact same environment image image image

Somehow, bitsandbytes isn't being made available @younesbelkada and @amyeroberts can you help? the code is below

!pip install -q -U bitsandbytes==0.41.1
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U accelerate==0.25.dev0
!pip install -q -U einops
!pip install -q -U safetensors
!pip install -q -U torch
!pip install -q -U xformers
!pip install -q -U langchain
!pip install -q -U ctransformers[cuda]
!pip install chromadb
!pip install sentence-transformers
!pip install -q -U accelerate
!pip install bitsandbytes
!pip install -i https://test.pypi.org/simple/ bitsandbytes
!pip install --upgrade langchain
!pip install transformers==4.35
#loading packges
from torch import cuda, bfloat16
import transformers
from transformers import StoppingCriteria, StoppingCriteriaList
import torch
from langchain.document_loaders import UnstructuredFileLoader
from langchain.chains.summarize import load_summarize_chain
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
from langchain import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
import accelerate
import bitsandbytes
base_model_id = "mistralai/Mistral-7B-Instruct-v0.1"
baseline = AutoModelForCausalLM.from_pretrained(base_model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
print("loaded all packages")
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
print("Printing Device...")
print(device)
print("loading model....")
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth#,
#    load_in_8bit=False    
)
bnb_config = transformers.BitsAndBytesConfig(load_in_4bit = True,
bnb_4bit_quant_tyoe = 'nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
#    load_in_8bit=False,
    device_map='auto',
    use_auth_token=hf_auth,
    offload_folder="save_folder"
)
# Load model directly
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, use_auth_token=hf_auth)
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True, 
    task='text-generation',
    temperature=0.1,  
    max_new_tokens=4096,  
    repetition_penalty=1.1 
)
print("loaded model")
llm = HuggingFacePipeline(pipeline=generate_text)
younesbelkada commented 1 year ago

I see PyTorch Version (GPU?) - xxx (False) - maybe your bitsandbytes install is broken due to CUDA / GPU hardware not properly detected. Are you able to run !nvidia-smi inside your notebook and what is its outcome?

Abhaycnvrg commented 1 year ago

So it doesn't work without GPU? only on CPU for eg?

younesbelkada commented 1 year ago

Yes bnb does not work on CPU, you need to have access to a GPU. You can for instance use free-tier google colab instances that provides a decent 16GB NVIDIA T4 GPU

Abhaycnvrg commented 1 year ago

What do you mean by bnb? You mean mistral doesn't work on CPU?

younesbelkada commented 1 year ago

I meant bitsandbytes, i.e. all quantization features such as load_in_8bit or load_in_4bit

Abhaycnvrg commented 1 year ago

okay thanks! can mistral work without quantisation code? i mean, we just want to run the inference

younesbelkada commented 1 year ago

On CPU yes, but it might be slow, please consider using Mistral-7b on a free tier Google colab instance using bitsandbytes 4bit

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "ybelkada/Mistral-7B-v0.1-bf16-sharded"
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)

I advise to use https://huggingface.co/ybelkada/Mistral-7B-v0.1-bf16-sharded as the weights are sharded with smaller shards (~2GB) otherwise it will lead to CPU OOM when trying to load mistral weights on google colab

Abhaycnvrg commented 1 year ago

Thanks for this code, but can you point me to a tutorial post which does inference from mistral for cpu only? We are using custom machines with a limited scalability so OOM should not be a problem

younesbelkada commented 1 year ago

For CPU only

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "ybelkada/Mistral-7B-v0.1-bf16-sharded"
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True)

Should do the trick, if you want to load the model in bfloat16:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "ybelkada/Mistral-7B-v0.1-bf16-sharded"
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16)
Abhaycnvrg commented 1 year ago

Hey @younesbelkada , I tried with GPU but got this error

RuntimeError: Failed to import transformers.models.mistral.modeling_mistral because of the following error (look up to see its traceback): Failed to import transformers.integrations.peft because of the following error (look up to see its traceback): cannot import name 'dispatch_model' from 'accelerate' (unknown location)

transformers = 4.36.dev0 , 4.35, 4.35.2 accelerate = 0.25.0.dev0 bitsandbytes = 0.41.2.post2 python = 3.10.6

younesbelkada commented 1 year ago

Hello @Abhaycnvrg dispatch_model should be still in accelerate init: https://github.com/huggingface/accelerate/blob/main/src/accelerate/__init__.py#L8 Can you share how did you installed accelerate? Can you also try to uninstall accelerate and re-install it ? pip uninstall accelerate && pip install -U accelerate Could you also share the full error traceback in case the error still persists?

Abhaycnvrg commented 1 year ago

get the same error by following these commands pip uninstall accelerate && pip install -U accelerate

Abhaycnvrg commented 1 year ago

my python version is 3.10.6 is that the cause of the problem? also, can you suggest me a container image with both python 3.9.16 and cuda installed in it, so that i can test with GPU? @younesbelkada and @amyeroberts

Abhaycnvrg commented 1 year ago

which is the torch and torchvision version i need to use with mistral and GPU?

Abhaycnvrg commented 1 year ago

Also, can you suggest which container image (from nvidia or docker hub) should I use for running this? are these ones https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-03.html#rel-23-03 or https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-07.html#rel-23-07 okay?

amyeroberts commented 1 year ago

@Abhaycnvrg transformers officially supports python 3.8 and above. You can find images by search docker hub - the hugging face one for pytorch-gpu is here. The compatible versions of library packages can be found in setup.py. Running pip install transformers will find and install the compatible packages - and warn you if this isn't possible based on different libraries' requirements. You don't need torchvision for mistral - it's an LLM.

Abhaycnvrg commented 1 year ago

Thanks @amyeroberts , so if i do the following: -

  1. REQUIREMENTS FILE : python == 3.9.16 / transformers == 4.35.0 / accelerate 0.25.dev0 (from source) / bitsandbytes 0.41.1
  2. container image: https://hub.docker.com/r/huggingface/transformers-pytorch-gpu
  3. and this code here
    
    from torch import cuda, bfloat16
    import transformers
    from transformers import StoppingCriteria, StoppingCriteriaList
    import torch
    from langchain.document_loaders import UnstructuredFileLoader
    from langchain.chains.summarize import load_summarize_chain
    from langchain.chains.question_answering import load_qa_chain
    from langchain.llms import HuggingFacePipeline
    from transformers import AutoTokenizer, AutoModelForCausalLM
    from langchain import PromptTemplate
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    import accelerate
    import bitsandbytes
    #base_model_id = "mistralai/Mistral-7B-Instruct-v0.1"
    #baseline = AutoModelForCausalLM.from_pretrained(base_model_id, device_map="auto")
    #tokenizer = AutoTokenizer.from_pretrained(base_model_id)
    print("loaded all packages")
    device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
    print("Printing Device...")
    print(device)
    print("loading model....")
    model_id = "mistralai/Mistral-7B-Instruct-v0.1"
    model_id = "ybelkada/Mistral-7B-v0.1-bf16-sharded"
    model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth#,
    #    load_in_8bit=False    
    )
    bnb_config = transformers.BitsAndBytesConfig(load_in_4bit = True,
    bnb_4bit_quant_tyoe = 'nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
    )

model = transformers.AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, config=model_config, quantization_config=bnb_config,

load_in_8bit=False,

device_map='auto',
use_auth_token=hf_auth,
offload_folder="save_folder"

)

Load model directly

tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, use_auth_token=hf_auth) generate_text = transformers.pipeline( model=model, tokenizer=tokenizer, return_full_text=True, task='text-generation', temperature=0.1,
max_new_tokens=4096,
repetition_penalty=1.1 ) print("loaded model") llm = HuggingFacePipeline(pipeline=generate_text)


it should work?
amyeroberts commented 1 year ago

As you're running from the dev version of accelerate I can't guarantee that it will be compatible with all of the other packages. Why not run and find out? 🤷‍♀️

jitender-cnvrg commented 1 year ago

Sure we will try but if you have any other recommended configs, please do send

younesbelkada commented 1 year ago

hi @jitender-cnvrg @Abhaycnvrg Thanks a lot for iterating, loading a model in 4bit / 8bit should work out of the box on a simple Free-tier Google colab instance. Make sure to select T4 on runtime type. I made a quick example here: https://colab.research.google.com/drive/1zia3Q9FXhNHOhdwA9p8zD4qgPEWkZvHl?usp=sharing and made sure it works.

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.