turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.66k stars 214 forks source link

Running Llama2 on multiple GPUs outputs gibberish #269

Closed mirth closed 10 months ago

mirth commented 10 months ago

I'm running the following code on 2x4090 and model outputs gibberish. Running it on a single 4090 works well. I've tried both Llama-2-7B-chat-GPTQ and Llama-2-70B-chat-GPTQ with gptq-4bit-128g-actorder_True branch.

from exllama.model import ExLlama, ExLlamaCache, ExLlamaConfig, ExLlamaDeviceMap
from exllama.tokenizer import ExLlamaTokenizer
from exllama.generator import ExLlamaGenerator
import os, glob

# Directory containing model, tokenizer, generator

model_directory =  "/app/TheBloke/Llama-2-7B-chat-GPTQ"

# Locate files we need within that directory

tokenizer_path = os.path.join(model_directory, "tokenizer.model")
model_config_path = os.path.join(model_directory, "config.json")
st_pattern = os.path.join(model_directory, "*.safetensors")
model_path = glob.glob(st_pattern)[0]

# Create config, model, tokenizer and generator

config = ExLlamaConfig(model_config_path)               # create config from config.json
config.set_auto_map('2,2')
config.model_path = model_path                          # supply path to model weights file

model = ExLlama(config)                                 # create ExLlama instance and load the weights
tokenizer = ExLlamaTokenizer(tokenizer_path)            # create tokenizer from tokenizer model file

cache = ExLlamaCache(model)                             # create cache for inference
generator = ExLlamaGenerator(model, tokenizer, cache)   # create generator

# Configure generator

generator.disallow_tokens([tokenizer.eos_token_id])

generator.settings.temperature = 0.7

# Produce a simple generation

# prompt = "Once upon a time,"
prompt = (
    "[INST] <<SYS>>"
    "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."  # noqa: E501
    "<</SYS>>"
    "Who is president of US? [/INST]"
)
print (prompt, end = "")

output = generator.generate_simple(prompt, max_new_tokens = 200)

print(output[len(prompt):])

I'm using pip package: exllama @ git+https://github.com/jllllll/exllama@423144a10fb22603e4de82abb4c965805cefd54f

mirth commented 10 months ago

Updating nvidia-drivers worked

turboderp commented 10 months ago

There's also, if anyone else is seeing similar issues, a driver/Torch issue that will break multi-GPU setups and can be worked around with the --gpu_peer_fix argument.