unexpected results when using faster transformer

Louis-y-nlp commented 1 year ago

Thank you for your great work. I converted an MPT-7B-Instruct model to the FT format and successfully ran inference, but I obtained some unexpected results (usually some garbled text but is not completely unrelated), such as

Explain to me the difference between nuclear fission and fusion.
'Nuclear Fission involves splitting atoms into smaller pieces, while Nuclear Fusion occurs when two or more atomic nuclei merge together in order for energy release.<br>Fusion happens naturally within stars as well but can also be harnessed by humans through reactors such like tokamak\'s which use magnetic fields instead of lasers<p><strong style="font-weight：bold;"></ strong></ p>&nbsp; <em >  </ em>.&lt;< /li &gt;&amp ;'

I assure you that I used the correct prompt format, and most of the parameters in the inference code were set to their default values. I'm not sure where the issue lies. I would greatly appreciate it if you could provide a correctly converted and tested model that I can use to determine whether the problem lies with my code or the converted model. Additionally, here is my demo scripts.

import argparse
import configparser
import os
import sys

import torch
from torch.nn.utils.rnn import pad_sequence
from transformers import AutoTokenizer

dir_path = os.path.dirname(os.path.realpath(__file__))
sys.path.append(os.path.join(dir_path, '../../..'))
import examples.pytorch.gpt.utils.gpt_token_encoder as encoder
from examples.pytorch.gpt.utils import comm, gpt_decoder
from examples.pytorch.gpt.utils.parallel_gpt import ParallelGPT
import time

INSTRUCTION_KEY = "### Instruction:"
RESPONSE_KEY = "### Response:"
END_KEY = "### End"
INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
PROMPT_FOR_GENERATION_FORMAT = """{intro}
{instruction_key}
{instruction}
{response_key}
""".format(
    intro=INTRO_BLURB,
    instruction_key=INSTRUCTION_KEY,
    instruction="{instruction}",
    response_key=RESPONSE_KEY,
)

parser = argparse.ArgumentParser()
parser.add_argument('--layer_num',
                    type=int,
                    default=32,
                    help='number of layers')
parser.add_argument('--head_num', type=int, default=32, help='head number')
parser.add_argument('--size_per_head',
                    type=int,
                    default=128,
                    help='size per head')
parser.add_argument('--vocab_size',
                    type=int,
                    default=50432,
                    help='vocab size')
parser.add_argument('--tensor_para_size',
                    type=int,
                    default=1,
                    help='tensor parallel size')
parser.add_argument('--pipeline_para_size',
                    type=int,
                    default=1,
                    help='pipeline parallel size')
parser.add_argument('--ckpt_path',
                    type=str,
                    default='mpt-ft-7b/1-gpu',
                    help='path to the FT checkpoint file.')
parser.add_argument(
    '--tokenizer_name_or_path',
    type=str,
    default='EleutherAI/gpt-neox-20b',
    help=
    'Name of the tokenizer or the directory where the tokenizer file is located.'
)
parser.add_argument(
    '--lib_path',
    type=str,
    help=
    'path to the libth_transformer dynamic lib file(.e.g., build/lib/libth_transformer.so.'
)
parser.add_argument('--start_id',
                    type=int,
                    default=0,
                    help='start token id.')
parser.add_argument('--end_id', type=int, default=0, help='end token id.')
parser.add_argument(
    '--max_seq_len',
    type=int,
    default=2048,
    help='max sequence length for position embedding table.')
parser.add_argument('--inference_data_type',
                    '--data_type',
                    type=str,
                    choices=['fp32', 'fp16', 'bf16'],
                    default='fp16')

parser.add_argument(
    '--disable_random_seed',
    dest='random_seed',
    action='store_false',
    help='Disable the use of random seed for sentences in a batch.')
parser.add_argument('--skip_end_tokens',
                    dest='skip_end_tokens',
                    action='store_false',
                    help='Whether to remove or not end tokens in outputs.')
parser.add_argument('--no_detokenize',
                    dest='detokenize',
                    action='store_false',
                    help='Skip detokenizing output token ids.')
parser.add_argument(
    '--int8_mode',
    type=int,
    default=0,
    choices=[0, 1],
    help='The level of quantization to perform.'
    ' 0: No quantization. All computation in data_type'
    ' 1: Quantize weights to int8, all compute occurs in fp16/bf16. Not supported when data_type is fp32'
)
parser.add_argument(
    '--weights_data_type',
    type=str,
    default='fp16',
    choices=['fp32', 'fp16'],
    help='Data type of FT checkpoint weights',
)
parser.add_argument(
    '--return_cum_log_probs',
    type=int,
    default=0,
    choices=[0, 1, 2],
    help='Whether to compute the cumulative log probsbility of sentences.'
    ' 0: do not return the cumulative log probs '
    ' 1: return the cumulative log probs of generated sequences'
    ' 2: return the cumulative log probs of sequences')
parser.add_argument('--shared_contexts_ratio',
                    type=float,
                    default=0.0,
                    help='Triggers the shared context optimization when'
                    'compact_size <= shared_contexts_ratio * batch_size'
                    'A value of 0.0 deactivate the optimization')
parser.add_argument(
    '--use_gpt_decoder_ops',
    action='store_true',
    help='Use separate decoder FT operators instead of end-to-end model op.'
)
parser.add_argument(
    '--no-alibi',
    dest='alibi',
    action='store_false',
    help='Do not use ALiBi (aka use_attention_linear_bias).')
parser.add_argument(
    '--layernorm_eps',
    type=float,
    default=1e-5,
    help='layernorm eps in PyTorch, by default, is 1e-5 and 1e-6 in FT.')
args = parser.parse_args()

ckpt_config = configparser.ConfigParser()
ckpt_config_path = os.path.join(args.ckpt_path, 'config.ini')
if os.path.isfile(ckpt_config_path):
    ckpt_config.read(ckpt_config_path)
if 'gpt' in ckpt_config.keys():
    for args_key, config_key, func in [
        ('layer_num', 'num_layer', ckpt_config.getint),
        ('max_seq_len', 'max_pos_seq_len', ckpt_config.getint),
        ('weights_data_type', 'weight_data_type', ckpt_config.get),
        ('layernorm_eps', 'layernorm_eps', ckpt_config.getfloat),
        ('alibi', 'use_attention_linear_bias', ckpt_config.getboolean),
    ]:
        if config_key in ckpt_config['gpt'].keys():
            prev_val = args.__dict__[args_key]
            args.__dict__[args_key] = func('gpt', config_key)
            print(
                'Loading {} from config.ini,    previous: {},    current: {}'
                .format(args_key, prev_val, args.__dict__[args_key]))
        else:
            print('Not loading {} from config.ini'.format(args_key))
    for key in ['head_num', 'size_per_head', 'tensor_para_size']:
        if key in args.__dict__:
            prev_val = args.__dict__[key]
            args.__dict__[key] = ckpt_config.getint('gpt', key)
            print(
                'Loading {} from config.ini,    previous: {},    current: {}'
                .format(key, prev_val, args.__dict__[key]))
        else:
            print('Not loading {} from config.ini'.format(key))

layer_num = args.layer_num
head_num = args.head_num
size_per_head = args.size_per_head
vocab_size = args.vocab_size
tensor_para_size = args.tensor_para_size
pipeline_para_size = args.pipeline_para_size
start_id = args.start_id
end_id = args.end_id
max_seq_len = args.max_seq_len
weights_data_type = args.weights_data_type
return_cum_log_probs = args.return_cum_log_probs
return_output_length = return_cum_log_probs > 0
shared_contexts_ratio = args.shared_contexts_ratio
layernorm_eps = args.layernorm_eps
use_attention_linear_bias = args.alibi
has_positional_encoding = not args.alibi

print('\n=================== Arguments ===================')
for k, v in vars(args).items():
    print(f'{k.ljust(30, ".")}: {v}')
print('=================================================\n')

tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name_or_path)
torch.manual_seed(0)

comm.initialize_model_parallel(args.tensor_para_size,
                               args.pipeline_para_size)
rank = comm.get_rank()
device = comm.get_device()

# Prepare model.
print("prepare model")
t=time.time()
if not args.use_gpt_decoder_ops:
    gpt = ParallelGPT(head_num,
                      size_per_head,
                      vocab_size,
                      start_id,
                      end_id,
                      layer_num,
                      max_seq_len,
                      tensor_para_size,
                      pipeline_para_size,
                      lib_path=args.lib_path,
                      inference_data_type=args.inference_data_type,
                      int8_mode=args.int8_mode,
                      weights_data_type=weights_data_type,
                      layernorm_eps=layernorm_eps,
                      use_attention_linear_bias=use_attention_linear_bias,
                      has_positional_encoding=has_positional_encoding,
                      shared_contexts_ratio=shared_contexts_ratio)
    if not gpt.load(ckpt_path=args.ckpt_path):
        print(
            '[WARNING] Checkpoint file not found. Model loading is skipped.'
        )
else:
    gpt = gpt_decoder.Gpt(num_heads=head_num,
                          size_per_head=size_per_head,
                          num_layers=layer_num,
                          vocab_size=vocab_size,
                          start_id=start_id,
                          end_id=end_id,
                          tensor_para_size=tensor_para_size,
                          pipeline_para_size=pipeline_para_size,
                          lib_path=args.lib_path,
                          max_seq_len=max_seq_len,
                          int8_mode=args.int8_mode,
                          weights_data_type=args.weights_data_type)
    gpt.load(args.ckpt_path, args.inference_data_type)

print(f"prepare model done, time={time.time()-t}")

def gpt_generate_fn(start_ids, start_lengths, output_len, infer_decode_args):
    if not args.use_gpt_decoder_ops:
        return gpt(start_ids,
                   start_lengths,
                   output_len,
                   return_output_length=return_output_length,
                   return_cum_log_probs=return_cum_log_probs,
                   **infer_decode_args)
    else:
        return gpt.generate(
            input_token_ids=start_ids,
            input_lengths=start_lengths,
            gen_length=output_len,
            eos_token_id=end_id,
            return_output_length=return_output_length,
            return_log_probs=return_cum_log_probs,
            **infer_decode_args)

def generate_text(sents, top_k=0, top_p=0.92, temperature=0.5, output_len=2048,
                  beam_width=1, len_penalty=0., beam_search_diversity_rate=0.,
                  repetition_penalty=5., presence_penalty=0., min_length=0):
    # Inputs
    assert isinstance(sents, list)
    batch_size = len(sents)
    start_ids = [
        torch.tensor(tokenizer.encode(PROMPT_FOR_GENERATION_FORMAT.format(instruction=c)), dtype=torch.int32, device=device)
        for c in sents
    ]
    start_lengths = [len(ids) for ids in start_ids]
    start_ids = pad_sequence(start_ids, batch_first=True, padding_value=end_id)
    start_lengths = torch.IntTensor(start_lengths)

    if args.random_seed:
        random_seed_tensor = torch.randint(0,
                                           10000,
                                           size=[batch_size],
                                           dtype=torch.int64)
    else:
        random_seed_tensor = torch.zeros([batch_size], dtype=torch.int64)

    repetition_penalty_vec = None if repetition_penalty == 1. else repetition_penalty * torch.ones(
        batch_size, dtype=torch.float32)
    presence_penalty_vec = None if presence_penalty == 0. else presence_penalty * torch.ones(
        batch_size, dtype=torch.float32)

    infer_decode_args = {
        'beam_width':
            beam_width,
        'top_k':
            top_k * torch.ones(batch_size, dtype=torch.int32),
        'top_p':
            top_p * torch.ones(batch_size, dtype=torch.float32),
        'temperature':
            temperature * torch.ones(batch_size, dtype=torch.float32),
        'repetition_penalty':
            repetition_penalty_vec,
        'presence_penalty':
            presence_penalty_vec,
        'beam_search_diversity_rate':
            beam_search_diversity_rate *
            torch.ones(batch_size, dtype=torch.float32),
        'len_penalty':
            len_penalty * torch.ones(size=[batch_size], dtype=torch.float32),
        'bad_words_list':
            None,
        'min_length':
            min_length * torch.ones(size=[batch_size], dtype=torch.int32),
        'random_seed':
            random_seed_tensor
    }

    # Generate tokens.
    gen_outputs = gpt_generate_fn(start_ids, start_lengths, output_len, infer_decode_args)

    if rank == 0:
        if not args.use_gpt_decoder_ops:
            if return_cum_log_probs > 0:
                tokens_batch, _, cum_log_probs = gen_outputs
            else:
                tokens_batch, cum_log_probs = gen_outputs, None
        else:
            tokens_batch = gen_outputs['output_token_ids']
            cum_log_probs = gen_outputs[
                'cum_log_probs'] if return_cum_log_probs > 0 else None
        if cum_log_probs is not None:
            print('[INFO] Log probs of sentences:', cum_log_probs)

        outputs = []
        tokens_batch = tokens_batch.cpu().numpy()
        for i, tokens in enumerate(tokens_batch):
            for beam_id in range(beam_width):
                token = tokens[beam_id][
                    start_lengths[i]:]  # exclude context input from the output
                if args.skip_end_tokens:
                    token = token[token != end_id]
                output = tokenizer.decode(
                    token) if args.detokenize else ' '.join(
                        str(t) for t in token.tolist())
                outputs.append(output)
        return outputs

here are some arguments

=================== Arguments ===================
layer_num.....................: 32
head_num......................: 32
size_per_head.................: 128
vocab_size....................: 50432
tensor_para_size..............: 1
pipeline_para_size............: 1
ckpt_path.....................: /data/models/generate_models/mosaicml_mpt-7b-instruct/1-gpu
tokenizer_name_or_path........: /data/models/generate_models/mosaicml_mpt-7b-instruct
lib_path......................: /mnt/work/FasterTransformer/build/lib/libth_transformer.so
start_id......................: 0
end_id........................: 0
max_seq_len...................: 2048
inference_data_type...........: fp16
random_seed...................: True
skip_end_tokens...............: True
detokenize....................: True
int8_mode.....................: 0
weights_data_type.............: fp32
return_cum_log_probs..........: 0
shared_contexts_ratio.........: 0.0
use_gpt_decoder_ops...........: False
alibi.........................: True
layernorm_eps.................: 1e-05
=================================================

dskhudia commented 1 year ago

@Louis-y-nlp : MPT was trained with bf16 datatype and that's what you should use for inference. I tried it with bf16 and I don't see garbled text.

sents = ['Explain to me the difference between nuclear fission and fusion.']
print(generate_text(sents)[0])
"""
Nuclear reactions are processes by which atomic nuclei undergo changes in their internal structure, resulting from either collisions or through controlled manipulation of energy released during these interactions with other particles such as photons (light) emitted when electrons move away due radioactive decay within atoms' nucleus after being excited into higher orbits around its core
"""

Louis-y-nlp commented 1 year ago

Thank you for your reply. However, it seems that it doesn't work for me, and there are still garbled outputs, especially when I try to prompt the model to generate non-English characters, such as Chinese, using Hugging Face's inference code works fine.

Louis-y-nlp commented 1 year ago

The default parameter repetition_penalty=5 is unreasonable. By reducing it to around 1.1, I can obtain the same results as torch. I will close this issue and thank you again.

Louis-y-nlp commented 1 year ago

@dskhudia Nice to see you again. I have a couple of other questions. Firstly, have you tested the speed improvement using ft? I conducted tests using the V100 GPU in the recommended Docker environment provided by nvidia. When using 1 GPU with a batch size of 1, the inference speed of FT is similar to FastChat. However, when using 2 GPUs, the program gets stuck after inferring a few questions. I am unable to pinpoint the issue, and any help would be greatly appreciated.

Secondly, I observed that when a maximum length is given, the length of the generated outputs (gen_outputs) in inference is always padded to the maximum length with EOS tokens. I'm unsure whether the model performed inference for that extended length and generated an infinite number of EOS tokens(that will cost lot of time) or if it's just a result of padding. Thanks again for your assistance.

dakinggg commented 1 year ago

Hey @dskhudia are you able to help out here?

ghost commented 1 year ago

@Louis-y-nlp , Sorry for the late reply. I missed it.

1) I am not familiar with FastChat and haven't run it. However, in comparison to HF generate we saw > 2x speedup at batch size 1 for the 7B model. We have run multi-gpu inference successfully with FT without any hang so not sure about the root cause. 2) Is this with FT?

dakinggg commented 1 year ago

Closing due to inactivity. Please feel free to open a new issue if you are still encountering problems.

mosaicml / llm-foundry

unexpected results when using faster transformer #341