Closed AlekseyKorshuk closed 2 years ago
There are two reasons.
@byshiue Thank you! Now the issue with different outputs does not occur without providing any seed. But now comparing the current build of main branches and 1 month ago, the inference time of the latest build is slower:
Benchmark:
Do you know why is it possible?
Also, the issue exists if passing seed.
With seed 0:
A: ['What about you?']
B: ['Then tell me who owns them. Tell me who you belong to.']
A+B: ['Yeah', "Okay. But first you have to prove to me that you're worth keeping around."]
Without seed:
A: ['Yeah']
B: ["Okay. But first you have to prove to me that you're worth keeping around."]
A+B: ['Yeah', "Okay. But first you have to prove to me that you're worth keeping around."]
Benchmark: torch: 25.71 tokens/second triton+ft (latest version): 17.17 triton+ft (old version): 33.17 tokens/second Do you know why is it possible?
Make sure you have run warmup for some iterations, and then run the benchmark with several iteration to get mean of throughput.
If you still encounter performance regression after checking these issues, you can try
nsys profile -o report-1 tritonserver --model-reposiroty=...
to launch the server. nsys
can help profiling the performance. And we can help to analyze the timeline if you can provide the report.
Also, the issue exists if passing seed.
Make sure you have fix the issue of sequence length. Besides, I also find another issues in your implementation. FT supports right-hand side padding, but not left hand side. So,
padding_side = "left"
tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side=padding_side)
should be fixed. Besides, you only post output_texts_cropped
, but not full output_texts
. But in FT, we will move the padding to the last after inference. For example, the inputs would be like
[id_1, pad]
[id_2, id_3]
and the output would be
[id_1, out_1, pad]
[id_2, id_3, out_2]
but not
[id_1, pad, out_1]
[id_2, id_3, out_2]
About the issue with seed and output difference. Here is the code check:
import time
import numpy as np
import pandas as pd
import requests
import tritonclient.http as httpclient
from collections.abc import Mapping
from tritonclient.utils import np_to_triton_dtype
from transformers import AutoTokenizer
import random
import tqdm
from copy import deepcopy
from datasets import load_dataset
dataset = load_dataset("ChaiML/user_model_inputs")
URL = "lit-v2-triton-latest.tenant-chairesearch-test.knative.chi.coreweave.com"
DEFAULT_CONFIG = {
'protocol': 'http',
'url': f'{URL}:80',
'model_name': 'fastertransformer',
'verbose': False,
}
dtype = "uint32"
GENERATION_CONFIG = {
"request": [
{
"name": "input_ids",
"data": [],
"dtype": "int32"
},
{
"name": "input_lengths",
"data": [],
"dtype": "int32"
},
{
"name": "request_output_len",
"data": [[64]],
"dtype": "int32"
},
{
"name": "temperature",
"data": [[0.72]],
"dtype": "float32"
},
{
"name": "repetition_penalty",
"data": [[1.13]],
"dtype": "float32"
},
{
"name": "random_seed",
"data": [[0]],
"dtype": "int32"
},
{
"name": "runtime_top_k",
"data": [[0]],
"dtype": "int32"
},
{
"name": "runtime_top_p",
"data": [[0.725]],
"dtype": "float32"
},
{
"name": "stop_words_list",
"data": [[[198], [1]]],
"dtype": "int32"
},
{
"name": "bad_words_list",
"data": [[[77, 15249, 77], [2, 5, 7]]],
"dtype": "int32"
}
]
}
padding_side = "right"
tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side=padding_side)
pad_token_id = 50256
tokenizer.pad_token_id = pad_token_id
assert tokenizer.pad_token_id == pad_token_id, 'incorrect padding token'
tokenizer.padding_side = padding_side
tokenizer.truncation_side = padding_side
def to_word_list_format(words):
flat_ids = []
offsets = []
item_flat_ids = []
item_offsets = []
for word in words:
ids = tokenizer.encode(word)
if len(ids) == 0:
continue
item_flat_ids += ids
item_offsets.append(len(ids))
flat_ids.append(np.array(item_flat_ids))
offsets.append(np.cumsum(np.array(item_offsets)))
pad_to = max(1, max(len(ids) for ids in flat_ids))
for i, (ids, offs) in enumerate(zip(flat_ids, offsets)):
flat_ids[i] = np.pad(ids, (0, pad_to - len(ids)), constant_values=0)
offsets[i] = np.pad(offs, (0, pad_to - len(offs)), constant_values=-1)
return np.array([flat_ids, offsets], dtype="int32").transpose((1, 0, 2))
def load_bad_word_ids():
forbidden = [
'test'
]
return to_word_list_format(forbidden)
GENERATION_CONFIG["request"][-1]["data"] = load_bad_word_ids()
def generate_parameters_from_texts(texts, random_seed=None):
params = deepcopy(GENERATION_CONFIG["request"])
inputs = tokenizer(texts, return_tensors="np", add_special_tokens=False, padding=True)
input_ids_no_pad = tokenizer(texts, return_tensors="np", add_special_tokens=False, padding=False).input_ids
input_ids = inputs.input_ids
input_sizes = [len(sample_input_ids) for sample_input_ids in input_ids_no_pad]
# print("INPUT")
# print(inputs.input_ids)
# print(input_ids.shape)
# print("#" * 100)
random_seed_index = 0
for index, value in enumerate(params):
if value['name'] == 'input_ids':
data = np.array([np.array(data) for data in input_ids], dtype=value['dtype'])
elif value['name'] == 'input_lengths':
value_data = [[len(sample_input_ids)] for sample_input_ids in input_ids_no_pad]
data = np.array([data for data in value_data], dtype=value['dtype'])
elif value['name'] == 'random_seed':
random_seed_index = index
data = np.array([[random_seed] for _ in range(len(input_ids))], dtype=value['dtype'])
# elif value['name'] == 'beam_width':
# data = np.array([[len(input_ids)] for _ in range(len(input_ids))], dtype=value['dtype'])
else:
data = np.array([data for data in value['data']] * len(input_ids), dtype=value['dtype'])
params[index] = {
'name': value['name'],
'data': data,
}
if random_seed == -1:
params.pop(random_seed_index)
return params, input_sizes
def prepare_tensor(client, name, input):
t = client.InferInput(name, input.shape, np_to_triton_dtype(input.dtype))
t.set_data_from_numpy(input)
return t
def get_last_message(text):
if text == "":
return ""
if text[-1] == "\n":
return ""
last_raw = text.split("\n")[-1]
last_message = last_raw.split(":")[-1]
return last_message.strip()
def triton_inference(inference_client, texts, random_seed=None):
request, input_sizes = generate_parameters_from_texts(texts, random_seed)
# print(request)
payload = [prepare_tensor(httpclient, field['name'], field['data'])
for field in request]
result = inference_client.infer(DEFAULT_CONFIG['model_name'], payload)
output_texts = []
output_texts_cropped = []
for input_size_tokens, output in zip(input_sizes, result.get_response()['outputs']):
if output['name'] == "output_ids":
for output_ids in result.as_numpy(output['name']):
output_ids = [int(output_id) for output_id in list(output_ids[0])]
# print(output_ids)
output_text = tokenizer.decode(output_ids, skip_special_tokens=True).strip()
output_texts.append(output_text)
output_texts_cropped.append(
get_last_message(output_text)
)
# output_texts_cropped.append(
# tokenizer.decode(
# output_ids[input_size_tokens:], skip_special_tokens=True
# ).strip()
# )
print(output_texts_cropped)
return output_texts
def get_stats(texts):
input_ids = tokenizer(texts, return_tensors="np", add_special_tokens=False).input_ids
input_sizes = [len(sample_input_ids) for sample_input_ids in input_ids]
return input_sizes
def main():
client = httpclient.InferenceServerClient(DEFAULT_CONFIG['url'], verbose=DEFAULT_CONFIG['verbose'], concurrency=1)
INPUT_EXAMPLES = dataset["train"]["text"][:2]
example1 = INPUT_EXAMPLES[0]
example2 = INPUT_EXAMPLES[1]
random_seed = 0
correct_output1 = triton_inference(client, [example1], random_seed=random_seed)[0]
example1_response = triton_inference(client, [example1, example1], random_seed)
print(f"Should be TRUE: {example1_response[0] == example1_response[1] == correct_output1}")
correct_output2 = triton_inference(client, [example2], random_seed=random_seed)[0]
example2_response = triton_inference(client, [example2, example2], random_seed)
print(f"Should be TRUE: {example2_response[0] == example2_response[1] == correct_output2}")
output1, output2 = triton_inference(client, [example1, example2], random_seed=random_seed)
print(f"Should be TRUE: {correct_output1 == output1}")
print(f"Should be TRUE: {correct_output2 == output2}")
if __name__ == "__main__":
main()
If remove the seed at all by setting it to -1 got the following output:
['Yeah']
['Yeah', 'Yeah']
Should be TRUE: True
["Okay. But first you have to prove to me that you're worth keeping around."]
["Okay. But first you have to prove to me that you're worth keeping around.", "Okay. But first you have to prove to me that you're worth keeping around."]
Should be TRUE: True
['Yeah', "Okay. But first you have to prove to me that you're worth keeping around."]
Should be TRUE: True
Should be TRUE: True
If set seed to any number (for example 0):
['What about you?']
['Yeah', 'Yeah']
Should be TRUE: False
['Then tell me who owns them. Tell me who you belong to.']
["Okay. But first you have to prove to me that you're worth keeping around.", "Okay. But first you have to prove to me that you're worth keeping around."]
Should be TRUE: False
['Yeah', "Okay. But first you have to prove to me that you're worth keeping around."]
Should be TRUE: False
Should be TRUE: False
About the inference time. Here is the nsys report: https://drive.google.com/file/d/1eWIyPV7AwBHVCcvy57mpVFf13EYy0AH-/view?usp=sharing
About the inference time. Here is the nsys report: https://drive.google.com/file/d/1eWIyPV7AwBHVCcvy57mpVFf13EYy0AH-/view?usp=sharing
Can you provide both older version and newer version?
Sure, i will do this now
Here is for the old one: https://drive.google.com/file/d/1eDfiRvEPfvCBFIXFRWUz5GsrxXnpaL9w/view?usp=sharing
For the issue of different results, it looks like a bug of config.pbtxt
in your demo.
Please try to change the random_seed
of config.pbtxt from TYPE_INT32
to TYPE_UINT64
.
Here is the code to benchmark endpoints:
import itertools
import numpy as np
import pandas as pd
import tritonclient.http as httpclient
from collections.abc import Mapping
from tritonclient.utils import np_to_triton_dtype
from transformers import AutoTokenizer
import random
import tqdm
from copy import deepcopy
import time
from datasets import load_dataset
MAX_SIZE = 1024
DTYPE = "uint32"
def _prepare_tensor(client, name, input):
t = client.InferInput(name, input.shape, np_to_triton_dtype(input.dtype))
t.set_data_from_numpy(input)
return t
class BenchMarkGPTJ6B(object):
def __init__(self):
self.request_batch = []
self.batch_size = 1
self.benchmark_averages = {}
self.iterations = []
self.total_duration = 0.0
self.total_output_token_size = 0
self.cut_off_to_measure = 10
self.num_exps = 10
self.dataset = None
self.input_examples = None
self.tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
self.tokenizer.pad_token_id = 50256
self.tokenizer.truncation_size = "left"
self.tokenizer.padding_side = "right"
self.prepare_examples()
def prepare_examples(self):
self.dataset = load_dataset("ChaiML/user_model_inputs")
# value = random.randint(0, len(self.dataset["train"]) - self.batch_size - 1)
value = 0
self.input_examples = self.dataset["train"]["text"][value:value + self.batch_size]
def _generate_inputs(self, input_size):
input_ids = self.tokenizer(self.input_examples, return_tensors="np", padding="longest",
truncation=True, max_length=input_size).input_ids
inputs_no_pad = self.tokenizer(self.input_examples, return_tensors="np", padding="longest",
truncation=True, max_length=input_size).input_ids
input_lengths = [len(x) for x in inputs_no_pad]
return input_ids, input_lengths
def construct_request_input(self, request, input_size, context_size, batch_size=1):
input_ids, input_lengths = self._generate_inputs(input_size)
for input_ids_, input_lengths_ in zip(input_ids, input_lengths):
# input_ids
request["request"][0]["data"].append(input_ids_)
# input_lengths
request["request"][1]["data"].append([input_lengths_])
# request_output_len
output_size = context_size - input_size
assert (context_size <= MAX_SIZE)
request["request"][2]["data"] = [[output_size]] * batch_size
# beam_search_diversity_rate
request["request"][3]["data"] = [[0]] * batch_size
# temperature
request["request"][4]["data"] = [[0.72]] * batch_size
# len_penalty
request["request"][5]["data"] = [[1.0]] * batch_size
# repetition_penalty
request["request"][6]["data"] = [[1.13]] * batch_size
# random_seed
request["request"][7]["data"] = [[1]] * batch_size
# is_return_log_probs
request["request"][8]["data"] = [[False]] * batch_size
# beam_width
request["request"][9]["data"] = [[1]] * batch_size
# runtime_top_k
request["request"][10]["data"] = [[0]] * batch_size
# runtime_top_p
request["request"][11]["data"] = [[0.725]] * batch_size
# stop_words_list
request["request"][12]["data"] = [[[198], [1]]] * batch_size
# bad_words_list
request["request"][13]["data"] = [[[0], [-1]]] * batch_size
return request
def construct_data(self, verbose=False):
# Change if you want to iterate over MULTIPLE which
# generates sequences upto the N-th multiple of 2
# self.input_sizes = _iter_token_sizes(multiple=7, default=False, verbose=False)
self.iterations = []
self.request_batch = []
self.input_sizes = [512]
self.context_sizes = [512 + 64]
assert len(self.input_sizes) == len(self.context_sizes)
for r in itertools.product(self.input_sizes, self.context_sizes):
if r[0] < r[1]:
self.iterations.append(r)
print(f"iterations : {self.iterations}")
if verbose:
print(f"self.input_sizes : {self.input_sizes}")
global DTYPE
for (input_size, context_size) in self.iterations:
request_template = {
"request": [
{
"name": "input_ids",
"data": [],
"dtype": DTYPE
},
{
"name": "input_lengths",
"data": [],
"dtype": DTYPE
},
{
"name": "request_output_len",
"data": [],
"dtype": DTYPE
},
{
"name": "beam_search_diversity_rate",
"data": [],
"dtype": "float32"
},
{
"name": "temperature",
"data": [],
"dtype": "float32"
},
{
"name": "len_penalty",
"data": [],
"dtype": "float32"
},
{
"name": "repetition_penalty",
"data": [],
"dtype": "float32"
},
{
"name": "random_seed",
"data": [],
"dtype": "int32"
},
{
"name": "is_return_log_probs",
"data": [],
"dtype": "bool"
},
{
"name": "beam_width",
"data": [],
"dtype": DTYPE
},
{
"name": "runtime_top_k",
"data": [],
"dtype": DTYPE
},
{
"name": "runtime_top_p",
"data": [],
"dtype": "float32"
},
{
"name": "stop_words_list",
"data": [],
"dtype": "int32"
},
{
"name": "bad_words_list",
"data": [],
"dtype": "int32"
}
]
}
if verbose:
print(f"input_size : {input_size}")
print(f"context_size : {context_size}")
self.request_batch.append(
self.construct_request_input(request_template, input_size, context_size, self.batch_size))
for request in self.request_batch:
for index, value in enumerate(request['request']):
if verbose:
print(f"value['name'] : {value['name']}")
print(f"value : {value}")
# print(value)
request['request'][index] = {
'name': value['name'],
'data': np.array(value['data'], dtype=value['dtype']),
}
def get_last_message(self, text):
if text == "":
return ""
if text[-1] == "\n":
return ""
last_raw = text.split("\n")[-1]
last_message = last_raw.split(":")[-1]
return last_message.strip()
def postprocess(self, output_ids):
return [
self.get_last_message(self.tokenizer.decode(output_ids_[0], skip_special_tokens=True).strip()) for
output_ids_ in
output_ids
]
def get_result(self, client, config, request, verbose=False):
payload = [_prepare_tensor(CLIENT_TYPE, field['name'], field['data'])
for field in request['request']]
start_time = time.time()
result = client.infer(config['model_name'], payload)
duration = time.time() - start_time
output_text = None
for output in result.get_response()['outputs']:
if output['name'] == "output_ids":
output_ids = result.as_numpy(output['name'])
output_text = self.postprocess(output_ids)
if verbose:
print(f"output_text : {output_text}")
if verbose:
print("{}:\n{}\n".format(output['name'], result.as_numpy(output['name'])))
return duration, result.as_numpy(result.get_response()['outputs'][1]['name'])[0][0], output_text
def warmup(self, client):
request = self.request_batch[0]
for _ in tqdm.trange(10, desc="Warming up"):
self.get_result(client, DEFAULT_CONFIG, request, verbose=False)
def benchmark(self, client, verbose=False):
self.benchmark_averages = dict()
self.construct_data(verbose=verbose)
self.warmup(client)
output_text = None
for exp_no in tqdm.trange(self.num_exps, desc="Benchmarking"):
for index, request in enumerate(self.request_batch):
if verbose:
print(f"============================================")
print(f"exp_no : {exp_no}")
print(f"Input Token Size : {self.iterations[index][0]}")
print(f"Batch Size : {self.batch_size}\n")
duration, result, output_text = self.get_result(client=client, config=DEFAULT_CONFIG, request=request,
verbose=verbose)
if verbose:
print(f"Output Token Size : {result - self.iterations[index][0]}")
self.total_output_token_size += result - self.iterations[index][0]
self.total_duration += duration
if verbose:
print(f"#############################\n")
print(
f"Inference time for Output Token Size = {result - self.iterations[index][0]} : {duration * 1000} ms \n")
print(f"#############################")
print(f"============================================")
tokens_per_second = (result - self.iterations[index][0]) / duration
if verbose:
print(f"#############################\n")
print(f"tokens_per_second : {tokens_per_second} Tokens/Second \n")
print(f"#############################")
json_result = {"model": "gpt-j-6b", "outputTokens": int(result - self.iterations[index][0]),
"inputTokens": self.iterations[index][0], "contextSize": self.iterations[index][1],
"duration_ms": duration * 1000, "tokensPerSecond": tokens_per_second,
"output_text": output_text}
exp_key = f"{self.iterations[index][0]}-{self.iterations[index][1]}-{int(self.iterations[index][1] - self.iterations[index][0])}-gpt-j-6b-{exp_no}"
self.benchmark_averages[exp_key] = json_result
def get_stats(self):
stats = {
"duration_ms": [],
"tokensPerSecond": [],
"outputTokens": [],
}
for key, value in self.benchmark_averages.items():
for stat_key, stat_value in stats.items():
stats[stat_key].append(value[stat_key])
df = pd.DataFrame(stats)
return df
def print_stats(self):
df = self.get_stats()
print()
print("=" * 21, "STATS", "=" * 21)
print(df.describe())
print("=" * 49)
DTYPE = "uint32"
URL = "lit-v2-triton-old.tenant-chairesearch-test.knative.chi.coreweave.com"
DEFAULT_CONFIG = {
'protocol': 'http',
'url': f'{URL}:80',
'model_name': 'fastertransformer',
'verbose': False,
}
CLIENT_TYPE = httpclient
client = httpclient.InferenceServerClient(DEFAULT_CONFIG['url'], verbose=DEFAULT_CONFIG['verbose'], concurrency=10)
bench = BenchMarkGPTJ6B()
bench.benchmark(client, verbose=False)
bench.print_stats()
DTYPE = "int32"
URL = "lit-v2-triton-latest.tenant-chairesearch-test.knative.chi.coreweave.com"
DEFAULT_CONFIG = {
'protocol': 'http',
'url': f'{URL}:80',
'model_name': 'fastertransformer',
'verbose': False,
}
client = httpclient.InferenceServerClient(DEFAULT_CONFIG['url'], verbose=DEFAULT_CONFIG['verbose'], concurrency=10)
bench.benchmark(client, verbose=False)
bench.print_stats()
Output:
OLD:
===================== STATS =====================
duration_ms tokensPerSecond outputTokens
count 10.000000 10.000000 10.0
mean 469.406748 22.110939 10.0
std 105.130958 4.031355 0.0
min 395.360947 15.170820 10.0
25% 396.981299 20.390768 10.0
50% 421.901941 23.702575 10.0
75% 492.536068 25.190130 10.0
max 659.160137 25.293343 10.0
=================================================
NEW:
===================== STATS =====================
duration_ms tokensPerSecond outputTokens
count 10.000000 10.000000 10.0
mean 1426.600003 4.222481 6.0
std 101.017563 0.262177 0.0
min 1356.654882 3.531660 6.0
25% 1375.118494 4.174454 6.0
50% 1393.091917 4.307011 6.0
75% 1437.314212 4.363261 6.0
max 1698.917866 4.422643 6.0
=================================================
But from your report, the newer one (nsys-report-4c19) only uses 7sec to handle 5 sentences, while the older one uses 10 sec to handle 5 sentences.
I have tested again on latest codes and the benchmark is like
===================== STATS =====================
duration_ms tokensPerSecond outputTokens
count 10.000000 10.000000 10.0
mean 351.808929 17.054744 6.0
std 0.504247 0.024379 0.0
min 351.335526 16.990481 6.0
25% 351.517081 17.051868 6.0
50% 351.705194 17.059742 6.0
75% 351.867616 17.068872 6.0
max 353.138924 17.077692 6.0
=================================================
on A100 when input length is 512, output length is 32, which is close to our benchmark at https://github.com/NVIDIA/FasterTransformer/blob/main/docs/gpt_guide.md#performance-of-gpt-67b. (Although this is benchmark of GPT-6.7B, but not GPT-J).
Can you build the codes again to make sure you can reproduce this issue? If you can reproduce this issue, can you provide the commit to reproduce this issue?
I still have a lower speed:
===================== STATS =====================
duration_ms tokensPerSecond outputTokens
count 10.000000 10.000000 10.0
mean 1398.001814 7.166076 10.0
std 65.583614 0.308294 0.0
min 1362.500906 6.369585 10.0
25% 1363.800526 7.220111 10.0
50% 1370.553136 7.296327 10.0
75% 1385.023892 7.332453 10.0
max 1569.961071 7.339445 10.0
=================================================
Can you share the needed config.pbtxt?
The issue with the seed is fixed by setting TYPE_UINT64, thank you! The only thing is the speed to resolve, maybe still config is wrong.
This is the config file I am using:
# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "gpt-j-6b"
max_batch_size: 1024
model_transaction_policy {
decoupled: False
}
input [
{
name: "input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "start_id"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "end_id"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "input_lengths"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "request_output_len"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "runtime_top_k"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_search_diversity_rate"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "len_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "random_seed"
data_type: TYPE_UINT64
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "is_return_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_width"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "bad_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
},
{
name: "stop_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
},
{
name: "prompt_learning_task_name_ids"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
}
]
output [
{
name: "output_ids"
data_type: TYPE_INT32
dims: [ -1, -1 ]
},
{
name: "sequence_length"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output_log_probs"
data_type: TYPE_FP32
dims: [ -1, -1 ]
}
]
instance_group [
{
count: 1
kind: KIND_CPU
}
]
parameters {
key: "tensor_para_size"
value: {
string_value: "1"
}
}
parameters {
key: "pipeline_para_size"
value: {
string_value: "1"
}
}
parameters {
key: "data_type"
value: {
string_value: "fp16"
}
}
parameters {
key: "model_type"
value: {
string_value: "GPT-J"
}
}
parameters {
key: "model_checkpoint_path"
value: {
string_value: "/mnt/pvc/triton-model-store-lit-latest/fastertransformer/1"
}
}
parameters {
key: "enable_custom_all_reduce"
value: {
string_value: "0"
}
}
Benchmark on A100 40GB (same model as before):
===================== STATS =====================
duration_ms tokensPerSecond outputTokens
count 10.000000 10.000000 10.0
mean 863.273740 11.884291 10.0
std 174.210427 1.668077 0.0
min 774.039030 7.424325 10.0
25% 777.982295 11.719851 10.0
50% 799.238801 12.517188 10.0
75% 853.254139 12.853763 10.0
max 1346.923828 12.919245 10.0
=================================================
Also made the same evaluation with an example model: https://mystic.the-eye.eu/public/AI/GPT-J-6B/step_383500_slim.tar.gz
===================== STATS =====================
duration_ms tokensPerSecond outputTokens
count 10.000000 10.000000 10.0
mean 1430.156088 9.789152 14.0
std 1.561623 0.010689 0.0
min 1427.980185 9.773862 14.0
25% 1428.893328 9.779634 14.0
50% 1430.096388 9.789551 14.0
75% 1431.546509 9.797793 14.0
max 1432.391882 9.804058 14.0
=================================================
Assume this is a sign that there is no problem with the model, but with something else.
Can you provide the commits for slower one and faster one?
I am not sure about the exact commits, but here is what I have:
Both images are based on the same Dockerfile provided upper, the only difference is the date of building.
I am not sure about the exact commits, but here is what I have:
- Old/faster image (pushed on Sep 7, 2022 at 1:58 am): https://hub.docker.com/layers/rtalaricw/gptj_ft/v1.2/images/sha256-816f96fdc80c962f0ef4968fe925555453da375c13edc6b0754142cd37dc7628?context=explore
- New/slower image (pushed on Oct 5, 2022 at 11:31 pm): https://hub.docker.com/layers/rtalaricw/gptj_ft/v1.2-22.04-new/images/sha256-0b8fdb0728cc38719fa2685c6a74175ff1d0c99a5a2f9151e805b2c18e6e390d?context=explore
Both images are based on the same Dockerfile provided upper, the only difference is the date of building.
Thank you for the information. We have found the reason, will fix it ASAP.
@AlekseyKorshuk I have updated the fixing into FT. Please try to re-build the docker image and test again.
@byshiue Thank you, got the following result:
===================== STATS =====================
duration_ms tokensPerSecond outputTokens
count 10.000000 10.000000 10.0
mean 438.109899 22.906456 10.0
std 28.672853 1.377898 0.0
min 421.448946 20.098810 10.0
25% 422.072709 22.975475 10.0
50% 423.495412 23.613012 10.0
75% 435.325563 23.692601 10.0
max 497.541904 23.727666 10.0
=================================================
I will close this issue now and reopen if something changes after detailed evaluation. Again, thank you for quick fix!
@byshiue After AB testing with current latest build the quality of responses was worse by ~50%: from 7.72 to 4.02 (higher better). Maybe you know the issue on the top of your head? Otherwise I might need to find a way to reproduce this issue without ab testing with fast feedback loop to share with you.
Compared 2 setups:
"Old" setup shows exactly the same score as default pytorch. Since "old" version can be used only with batch size 1, it can show good result in terms of quality without batching.
I am confused for
"Old" setup shows exactly the same score as default pytorch. Since "old" version can be used only with batch size 1, it can show good result in terms of quality without batching.
Why the old version can be used only with batch size 1? What about running new version with batch size 1?
From our internal test, we don't find any difference. So, it is hard to check this issue by current workflow. We need a workflow to reproduce this issue. If you can find a test (like summarization) and we can compare the score of FT and HF or new version and old version, we can help to take a look.
@byshiue Sorry for late reply. Sharing with you that now everything works fine. Last problem was with inference code (just my bug). The same quality of responses and working batched inference with padding. After AB testing can say that it is 1.35-1.65x speedup -> cost reduction. Thank you so much for the quick fix, I enjoyed chatting with you 🤗
Description
Inference outputs without batching do not match outputs with batching.
Such mismatching exists if padding is used →
Example
Case 1
Input:
[”My name is”]
Output:
[” Aleksey”]
Case 2
Input:
[”I am from”]
Output:
[” Belarus”]
Case 3
Input:
[”My name is, ”I am from”]
Output:
[” Alex”, ” UK”]
Expected behaviour
Case 3 should have the same outputs as in Case 1 and Case 2.
The following outputs should match:
Relevant issues
This issue should be resolved in FasterTransformers: https://github.com/NVIDIA/FasterTransformer/issues/312
Thoughts
The idea of batched inference is the usage of an attention mask, where such behavior should be handled. But there is no way to pass such parameter as input as well to tell the model pad_token (only bos=start_id and eos=end_id). Assume that such feature as passing attention_masks while inferencing can help a lot.
Also assume that there is a part here that requires changes: https://github.com/triton-inference-server/fastertransformer_backend/blob/5a78164d2449f12dfd23a575cc29d4e8e052f1bf/all_models/gptj/preprocessing/1/model.py#L150
How to reproduce
Docker image
Image available here:
rtalaricw/gptj_ft:v1.2-22.04-new
Image was build on 5th October with main branches of all needed repos.Config
Inference to reproduce an issue
Carbon copy