Description

Inference outputs without batching do not match outputs with batching.

Such mismatching exists if padding is used →

batch size > 1
there are different lengths in the batch
- pad_token used

Example

Case 1

Input:

[”My name is”]

Output:

[” Aleksey”]

Case 2

Input:

[”I am from”]

Output:

[” Belarus”]

Case 3

Input:

[”My name is, ”I am from”]

Output:

[” Alex”, ” UK”]

Expected behaviour

Case 3 should have the same outputs as in Case 1 and Case 2.

The following outputs should match:

single/no-batch inference
batched inference

Relevant issues

This issue should be resolved in FasterTransformers: https://github.com/NVIDIA/FasterTransformer/issues/312

Thoughts

The idea of batched inference is the usage of an attention mask, where such behavior should be handled. But there is no way to pass such parameter as input as well to tell the model pad_token (only bos=start_id and eos=end_id). Assume that such feature as passing attention_masks while inferencing can help a lot.

Also assume that there is a part here that requires changes: https://github.com/triton-inference-server/fastertransformer_backend/blob/5a78164d2449f12dfd23a575cc29d4e8e052f1bf/all_models/gptj/preprocessing/1/model.py#L150

How to reproduce

Docker image

Image available here: rtalaricw/gptj_ft:v1.2-22.04-new Image was build on 5th October with main branches of all needed repos.

# Base Image
ARG TRITON_VERSION=22.04
ARG BASE_IMAGE=nvcr.io/nvidia/tritonserver:${TRITON_VERSION}-py3
FROM ${BASE_IMAGE} as server-builder

# Get NIVIDIA keys to authenticate
RUN export this_distro="$(cat /etc/os-release | grep '^ID=' | awk -F'=' '{print $2}')" \
    && export this_version="$(cat /etc/os-release | grep '^VERSION_ID=' | awk -F'=' '{print $2}' | sed 's/[^0-9]*//g')" \
    && apt-key adv --fetch-keys "https://developer.download.nvidia.com/compute/cuda/repos/${this_distro}${this_version}/x86_64/7fa2af80.pub" \
    && apt-key adv --fetch-keys "https://developer.download.nvidia.com/compute/cuda/repos/${this_distro}${this_version}/x86_64/3bf863cc.pub"

# Run updates and install packages for build
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    openssh-server zsh tmux mosh locales-all clangd sudo \
    zip unzip wget build-essential autoconf autogen gdb \
    python3.8 python3-pip python3-dev rapidjson-dev \
    xz-utils zstd libz-dev && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Setup workdir for build
WORKDIR /workspace/build/

# CMake
RUN CMAKE_VERSION=3.18 && \
    CMAKE_BUILD=3.18.4 && \
    wget -nv https://cmake.org/files/v${CMAKE_VERSION}/cmake-${CMAKE_BUILD}.tar.gz && \
    tar -xf cmake-${CMAKE_BUILD}.tar.gz && \
    cd cmake-${CMAKE_BUILD} && \
    ./bootstrap --parallel=$(grep -c ^processor /proc/cpuinfo) -- -DCMAKE_USE_OPENSSL=OFF && \
    make -j"$(grep -c ^processor /proc/cpuinfo)" install && \
    cd /workspace/build/ && \
    rm -rf /workspace/build/cmake-${CMAKE_BUILD}

# backend build
WORKDIR /workspace/build/triton-experiments

RUN echo 2
RUN git clone https://github.com/triton-inference-server/fastertransformer_backend.git
RUN mv /workspace/build/triton-experiments/fastertransformer_backend/cmake /workspace/build/triton-experiments
RUN mv /workspace/build/triton-experiments/fastertransformer_backend/src /workspace/build/triton-experiments
RUN mv /workspace/build/triton-experiments/fastertransformer_backend/CMakeLists.txt /workspace/build/triton-experiments

ARG FORCE_BACKEND_REBUILD=0
RUN mkdir build -p && \
    cd build && \
    cmake \
      -D CMAKE_EXPORT_COMPILE_COMMANDS=1 \
      -D CMAKE_BUILD_TYPE=Release \
      -D CMAKE_INSTALL_PREFIX=/opt/tritonserver \
      -D TRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      -D TRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      -D TRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      .. && \
    make -j"$(grep -c ^processor /proc/cpuinfo)" install

# =================================
#  Runner Image
# =================================

FROM ${BASE_IMAGE} as server

ENV NCCL_LAUNCH_MODE=PARALLEL

COPY --from=server-builder /opt/tritonserver/backends/fastertransformer /opt/tritonserver/backends/fastertransformer

Config

name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "gpt-j-6b"
max_batch_size: 1024

model_transaction_policy {
  decoupled: False
}

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "start_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_search_diversity_rate"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "is_return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "prompt_learning_task_name_ids"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  }
]
instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]
parameters {
  key: "tensor_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "pipeline_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "data_type"
  value: {
    string_value: "fp16"
  }
}
parameters {
  key: "model_type"
  value: {
    string_value: "GPT-J"
  }
}
parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "/mnt/pvc/triton-model-store/fastertransformer/1"
  }
}
parameters {
  key: "enable_custom_all_reduce"
  value: {
    string_value: "0"
  }
}

Inference to reproduce an issue

/opt/tritonserver/bin/tritonserver --model-repository=YOUR_MODEL_PATH

import time
import numpy as np
import requests
import tritonclient.http as httpclient
from collections.abc import Mapping
from tritonclient.utils import np_to_triton_dtype
from transformers import AutoTokenizer
import random
import tqdm
from copy import deepcopy
from datasets import load_dataset

dataset = load_dataset("ChaiML/user_model_inputs")

DEFAULT_CONFIG = {
    'protocol': 'http',
    'url': f'localhost:8000',
    'model_name': 'fastertransformer',
    'verbose': False,
}

dtype = "uint32"

GENERATION_CONFIG = {
    "request": [
        {
            "name": "input_ids",
            "data": [],
            "dtype": dtype
        },
        {
            "name": "input_lengths",
            "data": [],
            "dtype": dtype
        },
        {
            "name": "request_output_len",
            "data": [[64]],
            "dtype": dtype
        },
        {
            "name": "beam_search_diversity_rate",
            "data": [[0]],
            "dtype": "float32"
        },
        {
            "name": "temperature",
            "data": [[0.72]],
            "dtype": "float32"
        },
        {
            "name": "repetition_penalty",
            "data": [[1.13]],
            "dtype": "float32"
        },
        {
            "name": "beam_width",
            "data": [[1]],
            "dtype": dtype
        },
        {
            "name": "random_seed",
            "data": [[0]],
            "dtype": "int32"
        },
        {
            "name": "runtime_top_k",
            "data": [[0]],
            "dtype": dtype
        },
        {
            "name": "runtime_top_p",
            "data": [[0.725]],
            "dtype": "float32"
        },
        {
            "name": "stop_words_list",
            "data": [[[198], [1]]],
            "dtype": "int32"
        },
        {
            "name": "bad_words_list",
            "data": [[[77, 15249, 77], [2, 5, 7]]],
            "dtype": "int32"
        }
    ]
}

padding_side = "left"
tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side=padding_side)
tokenizer.pad_token_id = 50256
assert tokenizer.pad_token_id == 50256, 'incorrect padding token'
tokenizer.padding_side = padding_side
tokenizer.truncation_side = padding_side

def to_word_list_format(words):
    flat_ids = []
    offsets = []
    item_flat_ids = []
    item_offsets = []

    for word in words:
        ids = tokenizer.encode(word)

        if len(ids) == 0:
            continue

        item_flat_ids += ids
        item_offsets.append(len(ids))

    flat_ids.append(np.array(item_flat_ids))
    offsets.append(np.cumsum(np.array(item_offsets)))

    pad_to = max(1, max(len(ids) for ids in flat_ids))

    for i, (ids, offs) in enumerate(zip(flat_ids, offsets)):
        flat_ids[i] = np.pad(ids, (0, pad_to - len(ids)), constant_values=0)
        offsets[i] = np.pad(offs, (0, pad_to - len(offs)), constant_values=-1)

    return np.array([flat_ids, offsets], dtype="int32").transpose((1, 0, 2))

def load_bad_word_ids():
    forbidden = [
        'samplebadword'
        ]

    return to_word_list_format(forbidden)

GENERATION_CONFIG["request"][-1]["data"] = load_bad_word_ids()

def generate_parameters_from_texts(texts, random_seed=None):
    params = deepcopy(GENERATION_CONFIG["request"])
    inputs = tokenizer(texts, return_tensors="np", add_special_tokens=False, padding=True)
    input_ids = inputs.input_ids
    for index, value in enumerate(params):

        if value['name'] == 'input_ids':
            data = np.array([np.array(data) for data in input_ids], dtype=value['dtype'])
        elif value['name'] == 'input_lengths':
            value_data = [[len(sample_input_ids)] for sample_input_ids in input_ids]
            data = np.array([data for data in value_data], dtype=value['dtype'])
        elif value['name'] == 'random_seed':
            if random_seed is None:
                random_seed = random.randint(0, 10000)
            data = np.array([[random_seed] for _ in range(len(input_ids))], dtype=value['dtype'])
        else:
            data = np.array([data for data in value['data']] * len(input_ids), dtype=value['dtype'])

        params[index] = {
            'name': value['name'],
            'data': data,
        }
    return params

def prepare_tensor(client, name, input):
    t = client.InferInput(name, input.shape, np_to_triton_dtype(input.dtype))
    t.set_data_from_numpy(input)
    return t

def triton_inference(inference_client, texts, random_seed=None):
    request = generate_parameters_from_texts(texts, random_seed)
    payload = [prepare_tensor(httpclient, field['name'], field['data'])
               for field in request]
    result = inference_client.infer(DEFAULT_CONFIG['model_name'], payload)
    output_texts = []
    output_texts_cropped = []

    for i, output in enumerate(result.get_response()['outputs']):
        if output['name'] == "output_ids":
            for output_ids in result.as_numpy(output['name']):
                output_ids = [int(output_id) for output_id in list(output_ids[0])]
                output_texts.append(tokenizer.decode(output_ids, skip_special_tokens=True).strip())
                output_texts_cropped.append(
                    tokenizer.decode(
                        output_ids[len(request[0]["data"][i]):], skip_special_tokens=True
                    ).strip()
                )
    return output_texts_cropped

def main():
    client = httpclient.InferenceServerClient(DEFAULT_CONFIG['url'], verbose=DEFAULT_CONFIG['verbose'], concurrency=10)

    INPUT_EXAMPLES = dataset["train"]["text"][:2]
    example1 = INPUT_EXAMPLES[0]
    example2 = INPUT_EXAMPLES[1]

    print(
        triton_inference(client, [example1], random_seed=0)
    )

    print(
        triton_inference(client, [example2], random_seed=0)
    )

    print(
        triton_inference(client, [example1, example2], random_seed=0)
    )

if __name__ == "__main__":
    main()

Carbon copy

@rtalaricw

There are two reasons.

Your input length are set wrong when you batch these two request. From my test, the input length of these two requests are 629 and 529 respectively. But for the batched request, its input lengths are both 629.
When the batch size is different (1 v.s. 2) it is possible to use different kernel to compute GEMM and lead to different results. If you want to compare, you can try compare example 1 of [example 1, example 1] and [example 1, example 2]. You should be able to compare example 2 of [example 2, example 2] and [example 1, example 2] after fixing bug 1.

@byshiue Thank you! Now the issue with different outputs does not occur without providing any seed. But now comparing the current build of main branches and 1 month ago, the inference time of the latest build is slower:

Benchmark:

torch: 25.71 tokens/second
triton+ft (latest version): 17.17
triton+ft (old version): 33.17 tokens/second

Do you know why is it possible?

Also, the issue exists if passing seed.

With seed 0:

A:   ['What about you?']
B:   ['Then tell me who owns them. Tell me who you belong to.']
A+B: ['Yeah', "Okay. But first you have to prove to me that you're worth keeping around."]

Without seed:

A:   ['Yeah']
B:   ["Okay. But first you have to prove to me that you're worth keeping around."]
A+B: ['Yeah', "Okay. But first you have to prove to me that you're worth keeping around."]

Benchmark: torch: 25.71 tokens/second triton+ft (latest version): 17.17 triton+ft (old version): 33.17 tokens/second Do you know why is it possible?

Make sure you have run warmup for some iterations, and then run the benchmark with several iteration to get mean of throughput.

If you still encounter performance regression after checking these issues, you can try

nsys profile -o report-1 tritonserver --model-reposiroty=...

to launch the server. nsys can help profiling the performance. And we can help to analyze the timeline if you can provide the report.

Also, the issue exists if passing seed.

Make sure you have fix the issue of sequence length. Besides, I also find another issues in your implementation. FT supports right-hand side padding, but not left hand side. So,

padding_side = "left"
tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side=padding_side)

should be fixed. Besides, you only post output_texts_cropped, but not full output_texts. But in FT, we will move the padding to the last after inference. For example, the inputs would be like

[id_1, pad]
[id_2, id_3]

and the output would be

[id_1, out_1, pad]
[id_2, id_3, out_2]

but not

[id_1, pad, out_1]
[id_2, id_3, out_2]

About the issue with seed and output difference. Here is the code check:

import time

import numpy as np
import pandas as pd
import requests
import tritonclient.http as httpclient
from collections.abc import Mapping
from tritonclient.utils import np_to_triton_dtype
from transformers import AutoTokenizer
import random
import tqdm
from copy import deepcopy
from datasets import load_dataset

dataset = load_dataset("ChaiML/user_model_inputs")

URL = "lit-v2-triton-latest.tenant-chairesearch-test.knative.chi.coreweave.com"

DEFAULT_CONFIG = {
    'protocol': 'http',
    'url': f'{URL}:80',
    'model_name': 'fastertransformer',
    'verbose': False,
}

dtype = "uint32"

GENERATION_CONFIG = {
    "request": [
        {
            "name": "input_ids",
            "data": [],
            "dtype": "int32"
        },
        {
            "name": "input_lengths",
            "data": [],
            "dtype": "int32"
        },
        {
            "name": "request_output_len",
            "data": [[64]],
            "dtype": "int32"
        },
        {
            "name": "temperature",
            "data": [[0.72]],
            "dtype": "float32"
        },
        {
            "name": "repetition_penalty",
            "data": [[1.13]],
            "dtype": "float32"
        },
        {
            "name": "random_seed",
            "data": [[0]],
            "dtype": "int32"
        },
        {
            "name": "runtime_top_k",
            "data": [[0]],
            "dtype": "int32"
        },
        {
            "name": "runtime_top_p",
            "data": [[0.725]],
            "dtype": "float32"
        },
        {
            "name": "stop_words_list",
            "data": [[[198], [1]]],
            "dtype": "int32"
        },
        {
            "name": "bad_words_list",
            "data": [[[77, 15249, 77], [2, 5, 7]]],
            "dtype": "int32"
        }
    ]
}

padding_side = "right"
tokenizer = AutoTokenizer.from_pretrained("gpt2", padding_side=padding_side)
pad_token_id = 50256
tokenizer.pad_token_id = pad_token_id
assert tokenizer.pad_token_id == pad_token_id, 'incorrect padding token'
tokenizer.padding_side = padding_side
tokenizer.truncation_side = padding_side

def to_word_list_format(words):
    flat_ids = []
    offsets = []
    item_flat_ids = []
    item_offsets = []

    for word in words:
        ids = tokenizer.encode(word)

        if len(ids) == 0:
            continue

        item_flat_ids += ids
        item_offsets.append(len(ids))

    flat_ids.append(np.array(item_flat_ids))
    offsets.append(np.cumsum(np.array(item_offsets)))

    pad_to = max(1, max(len(ids) for ids in flat_ids))

    for i, (ids, offs) in enumerate(zip(flat_ids, offsets)):
        flat_ids[i] = np.pad(ids, (0, pad_to - len(ids)), constant_values=0)
        offsets[i] = np.pad(offs, (0, pad_to - len(offs)), constant_values=-1)

    return np.array([flat_ids, offsets], dtype="int32").transpose((1, 0, 2))

def load_bad_word_ids():
    forbidden = [
        'test'
    ]
    return to_word_list_format(forbidden)

GENERATION_CONFIG["request"][-1]["data"] = load_bad_word_ids()

def generate_parameters_from_texts(texts, random_seed=None):
    params = deepcopy(GENERATION_CONFIG["request"])
    inputs = tokenizer(texts, return_tensors="np", add_special_tokens=False, padding=True)
    input_ids_no_pad = tokenizer(texts, return_tensors="np", add_special_tokens=False, padding=False).input_ids
    input_ids = inputs.input_ids
    input_sizes = [len(sample_input_ids) for sample_input_ids in input_ids_no_pad]
    # print("INPUT")
    # print(inputs.input_ids)
    # print(input_ids.shape)
    # print("#" * 100)
    random_seed_index = 0
    for index, value in enumerate(params):

        if value['name'] == 'input_ids':
            data = np.array([np.array(data) for data in input_ids], dtype=value['dtype'])
        elif value['name'] == 'input_lengths':
            value_data = [[len(sample_input_ids)] for sample_input_ids in input_ids_no_pad]
            data = np.array([data for data in value_data], dtype=value['dtype'])
        elif value['name'] == 'random_seed':
            random_seed_index = index
            data = np.array([[random_seed] for _ in range(len(input_ids))], dtype=value['dtype'])
        # elif value['name'] == 'beam_width':
        #     data = np.array([[len(input_ids)] for _ in range(len(input_ids))], dtype=value['dtype'])
        else:
            data = np.array([data for data in value['data']] * len(input_ids), dtype=value['dtype'])

        params[index] = {
            'name': value['name'],
            'data': data,
        }

    if random_seed == -1:
        params.pop(random_seed_index)

    return params, input_sizes

def prepare_tensor(client, name, input):
    t = client.InferInput(name, input.shape, np_to_triton_dtype(input.dtype))
    t.set_data_from_numpy(input)
    return t

def get_last_message(text):
    if text == "":
        return ""
    if text[-1] == "\n":
        return ""
    last_raw = text.split("\n")[-1]
    last_message = last_raw.split(":")[-1]
    return last_message.strip()

def triton_inference(inference_client, texts, random_seed=None):
    request, input_sizes = generate_parameters_from_texts(texts, random_seed)
    # print(request)
    payload = [prepare_tensor(httpclient, field['name'], field['data'])
               for field in request]
    result = inference_client.infer(DEFAULT_CONFIG['model_name'], payload)
    output_texts = []
    output_texts_cropped = []

    for input_size_tokens, output in zip(input_sizes, result.get_response()['outputs']):
        if output['name'] == "output_ids":
            for output_ids in result.as_numpy(output['name']):
                output_ids = [int(output_id) for output_id in list(output_ids[0])]
                # print(output_ids)
                output_text = tokenizer.decode(output_ids, skip_special_tokens=True).strip()
                output_texts.append(output_text)

                output_texts_cropped.append(
                    get_last_message(output_text)
                )
                # output_texts_cropped.append(
                #     tokenizer.decode(
                #         output_ids[input_size_tokens:], skip_special_tokens=True
                #     ).strip()
                # )
    print(output_texts_cropped)
    return output_texts

def get_stats(texts):
    input_ids = tokenizer(texts, return_tensors="np", add_special_tokens=False).input_ids
    input_sizes = [len(sample_input_ids) for sample_input_ids in input_ids]
    return input_sizes

def main():
    client = httpclient.InferenceServerClient(DEFAULT_CONFIG['url'], verbose=DEFAULT_CONFIG['verbose'], concurrency=1)

    INPUT_EXAMPLES = dataset["train"]["text"][:2]
    example1 = INPUT_EXAMPLES[0]
    example2 = INPUT_EXAMPLES[1]

    random_seed = 0

    correct_output1 = triton_inference(client, [example1], random_seed=random_seed)[0]
    example1_response = triton_inference(client, [example1, example1], random_seed)
    print(f"Should be TRUE: {example1_response[0] == example1_response[1] == correct_output1}")

    correct_output2 = triton_inference(client, [example2], random_seed=random_seed)[0]
    example2_response = triton_inference(client, [example2, example2], random_seed)
    print(f"Should be TRUE: {example2_response[0] == example2_response[1] == correct_output2}")

    output1, output2 = triton_inference(client, [example1, example2], random_seed=random_seed)

    print(f"Should be TRUE: {correct_output1 == output1}")
    print(f"Should be TRUE: {correct_output2 == output2}")

if __name__ == "__main__":
    main()

If remove the seed at all by setting it to -1 got the following output:

['Yeah']
['Yeah', 'Yeah']
Should be TRUE: True
["Okay. But first you have to prove to me that you're worth keeping around."]
["Okay. But first you have to prove to me that you're worth keeping around.", "Okay. But first you have to prove to me that you're worth keeping around."]
Should be TRUE: True
['Yeah', "Okay. But first you have to prove to me that you're worth keeping around."]
Should be TRUE: True
Should be TRUE: True

If set seed to any number (for example 0):

['What about you?']
['Yeah', 'Yeah']
Should be TRUE: False
['Then tell me who owns them. Tell me who you belong to.']
["Okay. But first you have to prove to me that you're worth keeping around.", "Okay. But first you have to prove to me that you're worth keeping around."]
Should be TRUE: False
['Yeah', "Okay. But first you have to prove to me that you're worth keeping around."]
Should be TRUE: False
Should be TRUE: False

About the inference time. Here is the nsys report: https://drive.google.com/file/d/1eWIyPV7AwBHVCcvy57mpVFf13EYy0AH-/view?usp=sharing

Can you provide both older version and newer version?

Sure, i will do this now

For the issue of different results, it looks like a bug of config.pbtxt in your demo. Please try to change the random_seed of config.pbtxt from TYPE_INT32 to TYPE_UINT64.

Here is the code to benchmark endpoints:

import itertools
import numpy as np
import pandas as pd
import tritonclient.http as httpclient
from collections.abc import Mapping
from tritonclient.utils import np_to_triton_dtype
from transformers import AutoTokenizer
import random
import tqdm
from copy import deepcopy
import time
from datasets import load_dataset

MAX_SIZE = 1024

DTYPE = "uint32"

def _prepare_tensor(client, name, input):
    t = client.InferInput(name, input.shape, np_to_triton_dtype(input.dtype))
    t.set_data_from_numpy(input)
    return t

class BenchMarkGPTJ6B(object):
    def __init__(self):
        self.request_batch = []
        self.batch_size = 1
        self.benchmark_averages = {}
        self.iterations = []
        self.total_duration = 0.0
        self.total_output_token_size = 0
        self.cut_off_to_measure = 10
        self.num_exps = 10
        self.dataset = None
        self.input_examples = None
        self.tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
        self.tokenizer.pad_token_id = 50256
        self.tokenizer.truncation_size = "left"
        self.tokenizer.padding_side = "right"
        self.prepare_examples()

    def prepare_examples(self):
        self.dataset = load_dataset("ChaiML/user_model_inputs")
        # value = random.randint(0, len(self.dataset["train"]) - self.batch_size - 1)
        value = 0
        self.input_examples = self.dataset["train"]["text"][value:value + self.batch_size]

    def _generate_inputs(self, input_size):
        input_ids = self.tokenizer(self.input_examples, return_tensors="np", padding="longest",
                                   truncation=True, max_length=input_size).input_ids
        inputs_no_pad = self.tokenizer(self.input_examples, return_tensors="np", padding="longest",
                                       truncation=True, max_length=input_size).input_ids
        input_lengths = [len(x) for x in inputs_no_pad]
        return input_ids, input_lengths

    def construct_request_input(self, request, input_size, context_size, batch_size=1):
        input_ids, input_lengths = self._generate_inputs(input_size)

        for input_ids_, input_lengths_ in zip(input_ids, input_lengths):
            # input_ids
            request["request"][0]["data"].append(input_ids_)
            # input_lengths
            request["request"][1]["data"].append([input_lengths_])

        # request_output_len
        output_size = context_size - input_size
        assert (context_size <= MAX_SIZE)

        request["request"][2]["data"] = [[output_size]] * batch_size

        # beam_search_diversity_rate
        request["request"][3]["data"] = [[0]] * batch_size

        # temperature
        request["request"][4]["data"] = [[0.72]] * batch_size

        # len_penalty
        request["request"][5]["data"] = [[1.0]] * batch_size

        # repetition_penalty
        request["request"][6]["data"] = [[1.13]] * batch_size

        # random_seed
        request["request"][7]["data"] = [[1]] * batch_size

        # is_return_log_probs
        request["request"][8]["data"] = [[False]] * batch_size

        # beam_width
        request["request"][9]["data"] = [[1]] * batch_size

        # runtime_top_k
        request["request"][10]["data"] = [[0]] * batch_size

        # runtime_top_p
        request["request"][11]["data"] = [[0.725]] * batch_size

        # stop_words_list
        request["request"][12]["data"] = [[[198], [1]]] * batch_size

        # bad_words_list
        request["request"][13]["data"] = [[[0], [-1]]] * batch_size

        return request

    def construct_data(self, verbose=False):
        # Change if you want to iterate over MULTIPLE which
        # generates sequences upto the N-th multiple of 2

        # self.input_sizes = _iter_token_sizes(multiple=7, default=False, verbose=False)
        self.iterations = []
        self.request_batch = []
        self.input_sizes = [512]
        self.context_sizes = [512 + 64]

        assert len(self.input_sizes) == len(self.context_sizes)

        for r in itertools.product(self.input_sizes, self.context_sizes):
            if r[0] < r[1]:
                self.iterations.append(r)

        print(f"iterations : {self.iterations}")

        if verbose:
            print(f"self.input_sizes : {self.input_sizes}")
        global DTYPE
        for (input_size, context_size) in self.iterations:
            request_template = {
                "request": [
                    {
                        "name": "input_ids",
                        "data": [],
                        "dtype": DTYPE
                    },
                    {
                        "name": "input_lengths",
                        "data": [],
                        "dtype": DTYPE
                    },
                    {
                        "name": "request_output_len",
                        "data": [],
                        "dtype": DTYPE
                    },
                    {
                        "name": "beam_search_diversity_rate",
                        "data": [],
                        "dtype": "float32"
                    },
                    {
                        "name": "temperature",
                        "data": [],
                        "dtype": "float32"
                    },
                    {
                        "name": "len_penalty",
                        "data": [],
                        "dtype": "float32"
                    },
                    {
                        "name": "repetition_penalty",
                        "data": [],
                        "dtype": "float32"
                    },
                    {
                        "name": "random_seed",
                        "data": [],
                        "dtype": "int32"
                    },
                    {
                        "name": "is_return_log_probs",
                        "data": [],
                        "dtype": "bool"
                    },
                    {
                        "name": "beam_width",
                        "data": [],
                        "dtype": DTYPE
                    },
                    {
                        "name": "runtime_top_k",
                        "data": [],
                        "dtype": DTYPE
                    },
                    {
                        "name": "runtime_top_p",
                        "data": [],
                        "dtype": "float32"
                    },
                    {
                        "name": "stop_words_list",
                        "data": [],
                        "dtype": "int32"
                    },
                    {
                        "name": "bad_words_list",
                        "data": [],
                        "dtype": "int32"
                    }
                ]
            }

            if verbose:
                print(f"input_size : {input_size}")
                print(f"context_size : {context_size}")

            self.request_batch.append(
                self.construct_request_input(request_template, input_size, context_size, self.batch_size))

        for request in self.request_batch:
            for index, value in enumerate(request['request']):
                if verbose:
                    print(f"value['name'] : {value['name']}")
                    print(f"value : {value}")
                # print(value)
                request['request'][index] = {
                    'name': value['name'],
                    'data': np.array(value['data'], dtype=value['dtype']),
                }

    def get_last_message(self, text):
        if text == "":
            return ""
        if text[-1] == "\n":
            return ""
        last_raw = text.split("\n")[-1]
        last_message = last_raw.split(":")[-1]
        return last_message.strip()

    def postprocess(self, output_ids):
        return [
            self.get_last_message(self.tokenizer.decode(output_ids_[0], skip_special_tokens=True).strip()) for
            output_ids_ in
            output_ids
        ]

    def get_result(self, client, config, request, verbose=False):
        payload = [_prepare_tensor(CLIENT_TYPE, field['name'], field['data'])
                   for field in request['request']]

        start_time = time.time()
        result = client.infer(config['model_name'], payload)
        duration = time.time() - start_time

        output_text = None
        for output in result.get_response()['outputs']:
            if output['name'] == "output_ids":
                output_ids = result.as_numpy(output['name'])
                output_text = self.postprocess(output_ids)
                if verbose:
                    print(f"output_text : {output_text}")
            if verbose:
                print("{}:\n{}\n".format(output['name'], result.as_numpy(output['name'])))

        return duration, result.as_numpy(result.get_response()['outputs'][1]['name'])[0][0], output_text

    def warmup(self, client):
        request = self.request_batch[0]
        for _ in tqdm.trange(10, desc="Warming up"):
            self.get_result(client, DEFAULT_CONFIG, request, verbose=False)

    def benchmark(self, client, verbose=False):
        self.benchmark_averages = dict()
        self.construct_data(verbose=verbose)
        self.warmup(client)

        output_text = None
        for exp_no in tqdm.trange(self.num_exps, desc="Benchmarking"):
            for index, request in enumerate(self.request_batch):
                if verbose:
                    print(f"============================================")
                    print(f"exp_no : {exp_no}")
                    print(f"Input Token Size : {self.iterations[index][0]}")
                    print(f"Batch Size : {self.batch_size}\n")
                duration, result, output_text = self.get_result(client=client, config=DEFAULT_CONFIG, request=request,
                                                                verbose=verbose)

                if verbose:
                    print(f"Output Token Size : {result - self.iterations[index][0]}")

                self.total_output_token_size += result - self.iterations[index][0]
                self.total_duration += duration

                if verbose:
                    print(f"#############################\n")
                    print(
                        f"Inference time for Output Token Size = {result - self.iterations[index][0]} : {duration * 1000} ms \n")
                    print(f"#############################")
                    print(f"============================================")

                tokens_per_second = (result - self.iterations[index][0]) / duration

                if verbose:
                    print(f"#############################\n")
                    print(f"tokens_per_second : {tokens_per_second} Tokens/Second \n")
                    print(f"#############################")

                json_result = {"model": "gpt-j-6b", "outputTokens": int(result - self.iterations[index][0]),
                               "inputTokens": self.iterations[index][0], "contextSize": self.iterations[index][1],
                               "duration_ms": duration * 1000, "tokensPerSecond": tokens_per_second,
                               "output_text": output_text}

                exp_key = f"{self.iterations[index][0]}-{self.iterations[index][1]}-{int(self.iterations[index][1] - self.iterations[index][0])}-gpt-j-6b-{exp_no}"

                self.benchmark_averages[exp_key] = json_result

    def get_stats(self):
        stats = {
            "duration_ms": [],
            "tokensPerSecond": [],
            "outputTokens": [],
        }
        for key, value in self.benchmark_averages.items():
            for stat_key, stat_value in stats.items():
                stats[stat_key].append(value[stat_key])

        df = pd.DataFrame(stats)
        return df

    def print_stats(self):
        df = self.get_stats()
        print()
        print("=" * 21, "STATS", "=" * 21)
        print(df.describe())
        print("=" * 49)

DTYPE = "uint32"
URL = "lit-v2-triton-old.tenant-chairesearch-test.knative.chi.coreweave.com"
DEFAULT_CONFIG = {
    'protocol': 'http',
    'url': f'{URL}:80',
    'model_name': 'fastertransformer',
    'verbose': False,
}
CLIENT_TYPE = httpclient

client = httpclient.InferenceServerClient(DEFAULT_CONFIG['url'], verbose=DEFAULT_CONFIG['verbose'], concurrency=10)
bench = BenchMarkGPTJ6B()
bench.benchmark(client, verbose=False)
bench.print_stats()

DTYPE = "int32"
URL = "lit-v2-triton-latest.tenant-chairesearch-test.knative.chi.coreweave.com"
DEFAULT_CONFIG = {
    'protocol': 'http',
    'url': f'{URL}:80',
    'model_name': 'fastertransformer',
    'verbose': False,
}
client = httpclient.InferenceServerClient(DEFAULT_CONFIG['url'], verbose=DEFAULT_CONFIG['verbose'], concurrency=10)
bench.benchmark(client, verbose=False)
bench.print_stats()

Output:

OLD: 
===================== STATS =====================
       duration_ms  tokensPerSecond  outputTokens
count    10.000000        10.000000          10.0
mean    469.406748        22.110939          10.0
std     105.130958         4.031355           0.0
min     395.360947        15.170820          10.0
25%     396.981299        20.390768          10.0
50%     421.901941        23.702575          10.0
75%     492.536068        25.190130          10.0
max     659.160137        25.293343          10.0
=================================================

NEW:
===================== STATS =====================
       duration_ms  tokensPerSecond  outputTokens
count    10.000000        10.000000          10.0
mean   1426.600003         4.222481           6.0
std     101.017563         0.262177           0.0
min    1356.654882         3.531660           6.0
25%    1375.118494         4.174454           6.0
50%    1393.091917         4.307011           6.0
75%    1437.314212         4.363261           6.0
max    1698.917866         4.422643           6.0
=================================================

But from your report, the newer one (nsys-report-4c19) only uses 7sec to handle 5 sentences, while the older one uses 10 sec to handle 5 sentences.

I have tested again on latest codes and the benchmark is like

===================== STATS =====================
       duration_ms  tokensPerSecond  outputTokens
count    10.000000        10.000000          10.0
mean    351.808929        17.054744           6.0
std       0.504247         0.024379           0.0
min     351.335526        16.990481           6.0
25%     351.517081        17.051868           6.0
50%     351.705194        17.059742           6.0
75%     351.867616        17.068872           6.0
max     353.138924        17.077692           6.0
=================================================

on A100 when input length is 512, output length is 32, which is close to our benchmark at https://github.com/NVIDIA/FasterTransformer/blob/main/docs/gpt_guide.md#performance-of-gpt-67b. (Although this is benchmark of GPT-6.7B, but not GPT-J).

Can you build the codes again to make sure you can reproduce this issue? If you can reproduce this issue, can you provide the commit to reproduce this issue?

I still have a lower speed:

===================== STATS =====================
       duration_ms  tokensPerSecond  outputTokens
count    10.000000        10.000000          10.0
mean   1398.001814         7.166076          10.0
std      65.583614         0.308294           0.0
min    1362.500906         6.369585          10.0
25%    1363.800526         7.220111          10.0
50%    1370.553136         7.296327          10.0
75%    1385.023892         7.332453          10.0
max    1569.961071         7.339445          10.0
=================================================

Can you share the needed config.pbtxt?

The issue with the seed is fixed by setting TYPE_UINT64, thank you! The only thing is the speed to resolve, maybe still config is wrong.

This is the config file I am using:

# Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
    # Redistribution and use in source and binary forms, with or without
    # modification, are permitted provided that the following conditions
    # are met:
    #  * Redistributions of source code must retain the above copyright
    #    notice, this list of conditions and the following disclaimer.
    #  * Redistributions in binary form must reproduce the above copyright
    #    notice, this list of conditions and the following disclaimer in the
    #    documentation and/or other materials provided with the distribution.
    #  * Neither the name of NVIDIA CORPORATION nor the names of its
    #    contributors may be used to endorse or promote products derived
    #    from this software without specific prior written permission.
    #
    # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS AND ANY
    # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
    # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
    # PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
    # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
    # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
    # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
    # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
    # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
    # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
    # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

    name: "fastertransformer"
    backend: "fastertransformer"
    default_model_filename: "gpt-j-6b"
    max_batch_size: 1024

    model_transaction_policy {
      decoupled: False
    }

    input [
      {
        name: "input_ids"
        data_type: TYPE_INT32
        dims: [ -1 ]
      },
      {
        name: "start_id"
        data_type: TYPE_INT32
        dims: [ 1 ]
        reshape: { shape: [ ] }
        optional: true
      },
      {
        name: "end_id"
        data_type: TYPE_INT32
        dims: [ 1 ]
        reshape: { shape: [ ] }
        optional: true
      },
      {
        name: "input_lengths"
        data_type: TYPE_INT32
        dims: [ 1 ]
        reshape: { shape: [ ] }
      },
      {
        name: "request_output_len"
        data_type: TYPE_INT32
        dims: [ -1 ]
      },
      {
        name: "runtime_top_k"
        data_type: TYPE_INT32
        dims: [ 1 ]
        reshape: { shape: [ ] }
        optional: true
      },
      {
        name: "runtime_top_p"
        data_type: TYPE_FP32
        dims: [ 1 ]
        reshape: { shape: [ ] }
        optional: true
      },
      {
        name: "beam_search_diversity_rate"
        data_type: TYPE_FP32
        dims: [ 1 ]
        reshape: { shape: [ ] }
        optional: true
      },
      {
        name: "temperature"
        data_type: TYPE_FP32
        dims: [ 1 ]
        reshape: { shape: [ ] }
        optional: true
      },
      {
        name: "len_penalty"
        data_type: TYPE_FP32
        dims: [ 1 ]
        reshape: { shape: [ ] }
        optional: true
      },
      {
        name: "repetition_penalty"
        data_type: TYPE_FP32
        dims: [ 1 ]
        reshape: { shape: [ ] }
        optional: true
      },
      {
        name: "random_seed"
        data_type: TYPE_UINT64
        dims: [ 1 ]
        reshape: { shape: [ ] }
        optional: true
      },
      {
        name: "is_return_log_probs"
        data_type: TYPE_BOOL
        dims: [ 1 ]
        reshape: { shape: [ ] }
        optional: true
      },
      {
        name: "beam_width"
        data_type: TYPE_INT32
        dims: [ 1 ]
        reshape: { shape: [ ] }
        optional: true
      },
      {
        name: "bad_words_list"
        data_type: TYPE_INT32
        dims: [ 2, -1 ]
        optional: true
      },
      {
        name: "stop_words_list"
        data_type: TYPE_INT32
        dims: [ 2, -1 ]
        optional: true
      },
      {
        name: "prompt_learning_task_name_ids"
        data_type: TYPE_INT32
        dims: [ 1 ]
        reshape: { shape: [ ] }
        optional: true
      }
    ]
    output [
      {
        name: "output_ids"
        data_type: TYPE_INT32
        dims: [ -1, -1 ]
      },
      {
        name: "sequence_length"
        data_type: TYPE_INT32
        dims: [ -1 ]
      },
      {
        name: "cum_log_probs"
        data_type: TYPE_FP32
        dims: [ -1 ]
      },
      {
        name: "output_log_probs"
        data_type: TYPE_FP32
        dims: [ -1, -1 ]
      }
    ]
    instance_group [
      {
        count: 1
        kind: KIND_CPU
      }
    ]
    parameters {
      key: "tensor_para_size"
      value: {
        string_value: "1"
      }
    }
    parameters {
      key: "pipeline_para_size"
      value: {
        string_value: "1"
      }
    }
    parameters {
      key: "data_type"
      value: {
        string_value: "fp16"
      }
    }
    parameters {
      key: "model_type"
      value: {
        string_value: "GPT-J"
      }
    }
    parameters {
      key: "model_checkpoint_path"
      value: {
        string_value: "/mnt/pvc/triton-model-store-lit-latest/fastertransformer/1"
      }
    }
    parameters {
      key: "enable_custom_all_reduce"
      value: {
        string_value: "0"
      }
    }

Benchmark on A100 40GB (same model as before):

===================== STATS =====================
       duration_ms  tokensPerSecond  outputTokens
count    10.000000        10.000000          10.0
mean    863.273740        11.884291          10.0
std     174.210427         1.668077           0.0
min     774.039030         7.424325          10.0
25%     777.982295        11.719851          10.0
50%     799.238801        12.517188          10.0
75%     853.254139        12.853763          10.0
max    1346.923828        12.919245          10.0
=================================================

Also made the same evaluation with an example model: https://mystic.the-eye.eu/public/AI/GPT-J-6B/step_383500_slim.tar.gz

===================== STATS =====================
       duration_ms  tokensPerSecond  outputTokens
count    10.000000        10.000000          10.0
mean   1430.156088         9.789152          14.0
std       1.561623         0.010689           0.0
min    1427.980185         9.773862          14.0
25%    1428.893328         9.779634          14.0
50%    1430.096388         9.789551          14.0
75%    1431.546509         9.797793          14.0
max    1432.391882         9.804058          14.0
=================================================

Assume this is a sign that there is no problem with the model, but with something else.

Can you provide the commits for slower one and faster one?

I am not sure about the exact commits, but here is what I have:

Old/faster image (pushed on Sep 7, 2022 at 1:58 am): https://hub.docker.com/layers/rtalaricw/gptj_ft/v1.2/images/sha256-816f96fdc80c962f0ef4968fe925555453da375c13edc6b0754142cd37dc7628?context=explore
New/slower image (pushed on Oct 5, 2022 at 11:31 pm): https://hub.docker.com/layers/rtalaricw/gptj_ft/v1.2-22.04-new/images/sha256-0b8fdb0728cc38719fa2685c6a74175ff1d0c99a5a2f9151e805b2c18e6e390d?context=explore

Both images are based on the same Dockerfile provided upper, the only difference is the date of building.

I am not sure about the exact commits, but here is what I have:

Old/faster image (pushed on Sep 7, 2022 at 1:58 am): https://hub.docker.com/layers/rtalaricw/gptj_ft/v1.2/images/sha256-816f96fdc80c962f0ef4968fe925555453da375c13edc6b0754142cd37dc7628?context=explore

New/slower image (pushed on Oct 5, 2022 at 11:31 pm): https://hub.docker.com/layers/rtalaricw/gptj_ft/v1.2-22.04-new/images/sha256-0b8fdb0728cc38719fa2685c6a74175ff1d0c99a5a2f9151e805b2c18e6e390d?context=explore

Both images are based on the same Dockerfile provided upper, the only difference is the date of building.

Thank you for the information. We have found the reason, will fix it ASAP.

@AlekseyKorshuk I have updated the fixing into FT. Please try to re-build the docker image and test again.

@byshiue Thank you, got the following result:

===================== STATS =====================
       duration_ms  tokensPerSecond  outputTokens
count    10.000000        10.000000          10.0
mean    438.109899        22.906456          10.0
std      28.672853         1.377898           0.0
min     421.448946        20.098810          10.0
25%     422.072709        22.975475          10.0
50%     423.495412        23.613012          10.0
75%     435.325563        23.692601          10.0
max     497.541904        23.727666          10.0
=================================================

I will close this issue now and reopen if something changes after detailed evaluation. Again, thank you for quick fix!

@byshiue After AB testing with current latest build the quality of responses was worse by ~50%: from 7.72 to 4.02 (higher better). Maybe you know the issue on the top of your head? Otherwise I might need to find a way to reproduce this issue without ab testing with fast feedback loop to share with you.

Compared 2 setups:

OLD (same "old" as before): https://hub.docker.com/layers/rtalaricw/gptj_ft/v1.2/images/sha256-816f96fdc80c962f0ef4968fe925555453da375c13edc6b0754142cd37dc7628?context=explore
Latest build with speed results from upper message before closing the issue.

"Old" setup shows exactly the same score as default pytorch. Since "old" version can be used only with batch size 1, it can show good result in terms of quality without batching.

I am confused for

"Old" setup shows exactly the same score as default pytorch. Since "old" version can be used only with batch size 1, it can show good result in terms of quality without batching.

Why the old version can be used only with batch size 1? What about running new version with batch size 1?

From our internal test, we don't find any difference. So, it is hard to check this issue by current workflow. We need a workflow to reproduce this issue. If you can find a test (like summarization) and we can compare the score of FT and HF or new version and old version, we can help to take a look.

@byshiue Sorry for late reply. Sharing with you that now everything works fine. Last problem was with inference code (just my bug). The same quality of responses and working batched inference with padding. After AB testing can say that it is 1.35-1.65x speedup -> cost reduction. Thank you so much for the quick fix, I enjoyed chatting with you 🤗

triton-inference-server / fastertransformer_backend

Unexpected behavior of batched inference of GPT-J #53

Description

Example

Case 1

Case 2

Case 3

Expected behaviour

Relevant issues

Thoughts

How to reproduce

Docker image

Config

Inference to reproduce an issue

Carbon copy