meta-llama / llama

Inference code for Llama models
Other
55.52k stars 9.47k forks source link

Slow inference and poor performance compared to Google Flan-UL2 #506

Open lachlancahill opened 1 year ago

lachlancahill commented 1 year ago

I have successfully run the 7b-chat model on my RTX-4070, but I am surprised at how long it takes to generate responses. I have tested it using a set of feature extraction tasks (I feed it a conversation transcript and ask it to answer True or False whether the conversation includes a given feature (EG: a complaint)). Google's Flan-UL2 model has 20B parameters, and is able to answer most questions in under 10 seconds (with 98% accuracy), but llama-7b-chat is taking 60+ seconds per question, and is scoring less than 15% accuracy. The poor accuracy could be attributed to the parameter count disadvantage (I haven't been able to test the 13b model as I only have 1 GPU), but I am very surprised by the slow inference time. Does anybody know what could be causing this? Code below.

import torch
from llama import Llama
import os
from tqdm import tqdm
import pandas as pd
from datetime import datetime

def main_process():

    # define parameters
    ckpt_dir = 'llama-2-7b-chat'
    tokenizer_path = r'tokenizer.model'
    temperature = 0.0
    top_p = 0.9
    max_seq_len = 3000
    max_batch_size = 1
    max_gen_len = None

    # set env variables:
    os.environ['RANK'] = '0'
    os.environ['WORLD_SIZE'] = '1'
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    input_data_file = r"C:\Users\path\to\test\questions\test_questions.xlsx"

    # Load the data
    input_data = pd.read_excel(input_data_file, sheet_name='input_data')

    generator = Llama.build(
        ckpt_dir=ckpt_dir,
        tokenizer_path=tokenizer_path,
        max_seq_len=max_seq_len,
        max_batch_size=max_batch_size,
        model_parallel_size=1,
    )

    system_responses, time_taken = [], []
    for _, row in tqdm(input_data.iterrows(), desc='iterating text dialogs', total=len(input_data)):

        start = datetime.now()

        system_prompt, input_string, correct_answer = row[['system_prompt', 'input_string', 'correct_answer']]

        dialogs = [
            [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": input_string},
            ]
        ]

        results = generator.chat_completion(
            dialogs,  
            max_gen_len=max_gen_len,
            temperature=temperature,
            top_p=top_p,
        )

        system_response = results[0]['generation']['content']

        system_responses.append(system_response)

        print(f"Response: {system_response}, Correct Answer: {correct_answer}")

        end = datetime.now()
        print(f"Time taken: {end - start}")

        time_taken.append(end - start)

    input_data['system_response'] = system_responses
    input_data['time_taken'] = time_taken

    input_data.to_excel(f'evaluation_run_{ckpt_dir}.xlsx', sheet_name='output_data')

if __name__ == '__main__':
    main_process()
MrAndersen101 commented 1 year ago

I am seeing similar performance issues. Wonder if anyone has any recommendations.

lachlancahill commented 1 year ago

I am seeing similar performance issues. Wonder if anyone has any recommendations.

I haven't been able to find a workaround, but I suspect this is because the Google FLAN-UL2 model uses automating device mapping, so some of the model is stored in RAM and processed using CPU, while the rest of the model is running on GPU. I think the 7B llama model is using GPU only, and can't fit entirely on a an RTX-4070, so different parts of the model are loaded on the GPU, computed, then replaced with the rest of the model, meaning the model is being essentially reloaded constantly throughout inference. I believe Google may be is using the accelerate module to achieve the prioritised load to GPU, with the rest computed on RAM/CPU, though can't say for sure. Hope this is helpful.

muellerzr commented 1 year ago

Correct, that is exactly what is happening, as such a large model requires device offloading for such a small GPU. How much RAM do you have available? Ideally you want enough RAM to load the model fully on the CPU, and avoid having to use storage offloading as that is the slowest of all

lachlancahill commented 11 months ago

I've looked into this and it does seem to be the accelerate module that was causing faster inference.

To get better performance out of Llama 2 I've switched to using the model through the Huggingface transformers library.

Installed dependencies:

pip install transformers accelerate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Ran model in a Pipeline using device_map='auto' which activates accelerate

from transformers import pipeline
import torch

llm_name = "meta-llama/Llama-2-7b-hf"

pipe = pipeline("text-generation",  model=llm_name, device_map='auto', torch_dtype=torch.float16)

prompt = f"one upon a time"

response = pipe(prompt, max_new_tokens=10, temperature=0.5)

response_text = response[0]['generated_text']

print(response_text )