I'm running the following code on 2x4090 and model outputs gibberish. Running it on a single 4090 works well.
I've tried both Llama-2-7B-chat-GPTQ and Llama-2-70B-chat-GPTQ with gptq-4bit-128g-actorder_True branch.
from exllama.model import ExLlama, ExLlamaCache, ExLlamaConfig, ExLlamaDeviceMap
from exllama.tokenizer import ExLlamaTokenizer
from exllama.generator import ExLlamaGenerator
import os, glob
# Directory containing model, tokenizer, generator
model_directory = "/app/TheBloke/Llama-2-7B-chat-GPTQ"
# Locate files we need within that directory
tokenizer_path = os.path.join(model_directory, "tokenizer.model")
model_config_path = os.path.join(model_directory, "config.json")
st_pattern = os.path.join(model_directory, "*.safetensors")
model_path = glob.glob(st_pattern)[0]
# Create config, model, tokenizer and generator
config = ExLlamaConfig(model_config_path) # create config from config.json
config.set_auto_map('2,2')
config.model_path = model_path # supply path to model weights file
model = ExLlama(config) # create ExLlama instance and load the weights
tokenizer = ExLlamaTokenizer(tokenizer_path) # create tokenizer from tokenizer model file
cache = ExLlamaCache(model) # create cache for inference
generator = ExLlamaGenerator(model, tokenizer, cache) # create generator
# Configure generator
generator.disallow_tokens([tokenizer.eos_token_id])
generator.settings.temperature = 0.7
# Produce a simple generation
# prompt = "Once upon a time,"
prompt = (
"[INST] <<SYS>>"
"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information." # noqa: E501
"<</SYS>>"
"Who is president of US? [/INST]"
)
print (prompt, end = "")
output = generator.generate_simple(prompt, max_new_tokens = 200)
print(output[len(prompt):])
I'm using pip package: exllama @ git+https://github.com/jllllll/exllama@423144a10fb22603e4de82abb4c965805cefd54f
There's also, if anyone else is seeing similar issues, a driver/Torch issue that will break multi-GPU setups and can be worked around with the --gpu_peer_fix argument.
I'm running the following code on 2x4090 and model outputs gibberish. Running it on a single 4090 works well. I've tried both
Llama-2-7B-chat-GPTQ
andLlama-2-70B-chat-GPTQ
withgptq-4bit-128g-actorder_True
branch.I'm using pip package:
exllama @ git+https://github.com/jllllll/exllama@423144a10fb22603e4de82abb4c965805cefd54f