marella / ctransformers

Python bindings for the Transformer models implemented in C/C++ using GGML library.
MIT License
1.75k stars 137 forks source link

Support for Mistral #149

Open Ananderz opened 9 months ago

Ananderz commented 9 months ago

A new set of 7b foundational models that claim to beat all 13b Llama 2 models in benchmarks.

https://huggingface.co/mistralai/Mistral-7B-v0.1 https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1

Is this easy to implement?

jagilley commented 9 months ago

Should work already, I've gotten it working with:

from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-GGUF", model_file="mistral-7b-instruct-v0.1.Q2_K.gguf", model_type="mistral", gpu_layers=50)

for text in llm("<s>[INST] Write a Python program that prints every even number from 5 to 500. [/INST]", stream=True):
    print(text, end="", flush=True)
Wolfsauge commented 9 months ago

I have the following difference with regard to the new special ChatML tokens.

https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca with transformers gives:

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import torch

model_id = "C:\\oobabooga\\text-generation-webui\\models\\Open-Orca_Mistral-7B-OpenOrca"
device = torch.device("cuda:0")

tokenizer = AutoTokenizer.from_pretrained(model_id)

# llm = AutoModelForCausalLM.from_pretrained(model_id).to(device)
print(tokenizer.encode("<|im_start|>"))
print(tokenizer.encode("<|im_end|>"))
(transformers) C:\Users\ns\Documents\Troubleshoot4>python test.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:05<00:00,  2.79s/it]
[1, 32001]
[1, 32000]

While https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF with ctransformers gives:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained(
    "C:\\Users\\ns\\Documents\\Troubleshoot4\\mistral-7b-openorca.Q4_K_M.gguf",
    model_file="mistral-7b-openorca.Q4_K_M.gguf",
    model_type="mistral",
    gpu_layers=50,
)

print(llm.tokenize("<|im_start|>"))
print(llm.tokenize("<|im_end|>"))
(ctransformers) C:\Users\ns\Documents\Troubleshoot4>python test3.py
[1, 523, 28766, 321, 28730, 2521, 28766, 28767]
[1, 523, 28766, 321, 28730, 416, 28766, 28767]
Wolfsauge commented 9 months ago

I am making a mistake here, mixing up encoding and tokenizing. I think it's not a bug in ctransformers or the GGUF model file.

I was trying to compare both models with regard to their treatment of the ChatML tokens in the input. Still trying to wrap my head around what is happening here. Sorry for the noise in this issue.

ZeroCool2u commented 9 months ago

Even installing ctransformers from the main branch and using the latest commit I can't get the mistral model type to be recognized. Is there something I'm missing here to make your example work? This is the error I get, "Model type 'mistral' is not supported."

It seems like I'm missing something obvious, but not sure what.

CHesketh76 commented 9 months ago

Should work already, I've gotten it working with:

from ctransformers import AutoModelForCausalLM

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-GGUF", model_file="mistral-7b-instruct-v0.1.Q2_K.gguf", model_type="mistral", gpu_layers=50)

for text in llm("<s>[INST] Write a Python program that prints every even number from 5 to 500. [/INST]", stream=True):
    print(text, end="", flush=True)

Did you install the ctransformers[cuda]? I am unable to run this if gpu_layers=50 is present. I am testing this on a CoLab environment and not locally

ZeroCool2u commented 9 months ago

No, I'm actually doing CPU only, which I thought would be better supported. I can try GPU if not though. I just did pip install ctransformers, but also tried pip installing directly from the main branch on github.

ZeroCool2u commented 8 months ago

Okay, so there's definitely something wrong with offline support in ctransformers. It will work when you let ctransformers download the model and store it in the default HF cache directory, but it won't work if you download the model manually and point to a local file. Seems like letting it download the model lets it just skip the invalid model type argument of mistral.

eugeneie commented 5 months ago

Confirming that the following works:

from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
    model_file="mistral-7b-instruct-v0.1.Q2_K.gguf",
    model_type="mistral",
    gpu_layers=0)

But the same doesn't work for Mixtral-8x7B-v0.1-GGUF files, i.e., this fails:

llm = AutoModelForCausalLM.from_pretrained(
    'TheBloke/Mixtral-8x7B-v0.1-GGUF',
    model_file='mixtral-8x7b-v0.1.Q2_K.gguf',
    model_type='mistral',
    gpu_layers=0)

with error:

RuntimeError: Failed to create LLM 'mistral' from '.../models--TheBloke--Mixtral-8x7B-v0.1-GGUF/blobs/27e3909257480e313a79ff63a1168df5ac7016917add8ad56b5dc489f9215f13'.

Is it because ctransformers don't support SMoE yet?

Ananderz commented 5 months ago

Confirming that the following works:

from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
    model_file="mistral-7b-instruct-v0.1.Q2_K.gguf",
    model_type="mistral",
    gpu_layers=0)

But the same doesn't work for Mixtral-8x7B-v0.1-GGUF files, i.e., this fails:

llm = AutoModelForCausalLM.from_pretrained(
    'TheBloke/Mixtral-8x7B-v0.1-GGUF',
    model_file='mixtral-8x7b-v0.1.Q2_K.gguf',
    model_type='mistral',
    gpu_layers=0)

with error:

RuntimeError: Failed to create LLM 'mistral' from '.../models--TheBloke--Mixtral-8x7B-v0.1-GGUF/blobs/27e3909257480e313a79ff63a1168df5ac7016917add8ad56b5dc489f9215f13'.

Is it because ctransformers don't support SMoE yet?

It works, try setting the model type to llama and it should work. There is no difference setting the modeltype to llama or mistral.

eugeneie commented 5 months ago

Re. trying to set to 'llama', same error:

RuntimeError: Failed to create LLM 'llama' from '../models--TheBloke--Mixtral-8x7B-v0.1-GGUF/blobs/27e3909257480e313a79ff63a1168df5ac7016917add8ad56b5dc489f9215f13'.

I checked the LLM class and understood that what actually matters is model_type='gguf' since both are GGUF encoded files.

I verified this locally by downloading both GGUF files from HF. Then ran this snippet:

from ctransformers import AutoModelForCausalLM
from transformers import MistralForCausalLM

model_path_1 = '~/mistral/Mistral-7B-v0.1-GGUF/mistral-7b-v0.1.Q8_0.gguf'
# model_path_2 = '~/mistral/Mixtral-8x7B-v0.1-GGUF/mixtral-8x7b-v0.1.Q2_K.gguf'

llm = AutoModelForCausalLM.from_pretrained(
    model_path_1,
    model_type=None,  # I've also tried to set this to llama or mistral, but class LLM will detect it as is_gguf(...)
    gpu_layers=0)

Using model_path_1 is fine, but model_path_2 will barf. The only difference is model_path_1 is the old 7B model and model_path_2 is the new Mixtral model (likely with dependencies on fast attention 2)?

RuntimeError: Failed to create LLM 'gguf' from '~/mistral/Mixtral-8x7B-v0.1-GGUF/mixtral-8x7b-v0.1.Q2_K.gguf'

Debug info:

$ pip show transformers
Name: transformers
Version: 4.37.0.dev0