Open TNT3530 opened 1 month ago
though it can not be stpped, i want to know what is the infer speed through 4xmi100. thanks a lot
though it can not be stpped, i want to know what is the infer speed through 4xmi100. thanks a lot
Just tested 0.4.1, 0.5.2 and 0.5.3post1 with Mistral Large Instruct 2407 GPTQ and the same thing happens. It's interesting how three different model architectures all have the same issue when quantized.
After reviewing the output of Llama 3.1 70b and Mistral Large via streaming and lowering the max response length, it seems generation continues due to the lack of a stop token. Here is the nonsense being generated:
Loading these models with auto_gptq using the following script
import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer
model_path = "<model>"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoGPTQForCausalLM.from_quantized(
model_path,
use_safetensors=True,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto",
disable_exllama=True,
use_fast=True,
use_triton=True
)
prompt = [
{ "role": "system", "content": "You are a helpful assistant that responds to user inquiries." },
{ "role": "user", "content": "What is the main benefit of AI Assistants?" }
]
inputs = tokenizer.apply_chat_template(
prompt,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True
).to("cuda")
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=128)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
Both Llama 3.1 and Mistral Large had good results in multiple trial runs. Cohere sadly isnt supported in auto_gptq so Command R+ couldnt be tested.
Your current environment
My Environment
OpenAI API launched using this command:
Docker launched using this command:
🐛 Describe the bug
When loading these models Command R Plus Llama 3.1 70B Llama 3.1 70B Alternate using a docker image built from source as of 2024-07-24, every prompt continues to generate until (i assume) hitting the token limit. It does this regardless of passed parameters like temperature. This has been happening with Command R+ since release, with versions ranging from 0.4.1 to 0.5.3post1.
Normal Command-R works using this model Llama 3.1 Instruct 8B straight from Meta works as well
Here is a sample of what Command R Plus generates I cant give a similar sample of Llama since the script used to generate the above throws an unrelated error about
freeze_support()
and all API calls just time out with no response.I tried force updating Transformers via PIP in the container but it did not fix the issue.
This is on a 4x AMD Instinct MI100 system with a GPU bridge