After triton fastertransformer backend, the inference speed is severely reduced

PAOPAO6 commented 1 year ago

Description

After using triton fastertransformer backend, the same model and the same data are much slower than torch code.

model: mt5

Reproduced Steps

result:

torch op:
ft: 100%|█████████████████████████████████████| 976/976 [00:46<00:00, 20.85it/s]
engine: ft, batch: 1, cps:2513.5870329060313, elapsed: 46.808405s, bleu: 25.534268557480004
ft: 100%|█████████████████████████████████████| 488/488 [00:26<00:00, 18.49it/s]
engine: ft, batch: 2, cps:4457.269456353881, elapsed: 26.396654s, bleu: 25.542048594569284
ft: 100%|█████████████████████████████████████| 244/244 [00:14<00:00, 17.09it/s]
engine: ft, batch: 4, cps:8239.068868524504, elapsed: 14.280376s, bleu: 25.502996135441286
ft: 100%|█████████████████████████████████████| 122/122 [00:07<00:00, 15.83it/s]
engine: ft, batch: 8, cps:15264.420564754077, elapsed: 7.707924s, bleu: 25.52744772193587

triton backend:
100%|█████████████████████████████████████████| 976/976 [00:55<00:00, 17.52it/s]
batch: 1, cps:2112.1205511740964, elapsed: 55.705627s bleu: 25.503309091716304
100%|█████████████████████████████████████████| 488/488 [00:41<00:00, 11.89it/s]
batch: 2, cps:2867.621256393294, elapsed: 41.029477s bleu: 25.531731907295494
100%|█████████████████████████████████████████| 244/244 [00:24<00:00,  9.89it/s]
batch: 4, cps:4770.055153228809, elapsed: 24.665753s bleu: 25.514632321649643
100%|█████████████████████████████████████████| 122/122 [00:13<00:00,  9.22it/s]
batch: 8, cps:8888.158091941497, elapsed: 13.237501s bleu: 25.525217557230196

byshiue commented 1 year ago

Please share the reproduced steps.

You can also use nsys to make sure you don't spend too many time on data copy.

PAOPAO6 commented 1 year ago

Please share the reproduced steps.

You can also use nsys to make sure you don't spend too many time on data copy.

the reproduced steps:

git clone https://github.com/triton-inference-server/fastertransformer_backend.git cd fastertransformer_backend && git checkout -b t5_gptj_blog remotes/origin/dev/t5_gptj_blog

docker build --rm --build-arg TRITON_VERSION=22.08 -t triton_with_ft:22.08 -f docker/Dockerfile .

docker run -e NVIDIA_VISIBLE_DEVICES=0 --name triton_ft --shm-size=4G --entrypoint "bash" -p 5722:22 -p 5780:8080 -v /data/:/data/ -itd triton_with_ft:22.08

docker exec -it triton_ft bash 5 . cd /data/mt/hbl/models/average1050000　 #　model path

CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/data/mt/hbl/models/average1050000/all_models/t5 --http-port 8080

python3 client.py

PAOPAO6 commented 1 year ago

Please share the reproduced steps. You can also use nsys to make sure you don't spend too many time on data copy.

the reproduced steps:

git clone https://github.com/triton-inference-server/fastertransformer_backend.git cd fastertransformer_backend && git checkout -b t5_gptj_blog remotes/origin/dev/t5_gptj_blog

docker build --rm --build-arg TRITON_VERSION=22.08 -t triton_with_ft:22.08 -f docker/Dockerfile .

docker run -e NVIDIA_VISIBLE_DEVICES=0 --name triton_ft --shm-size=4G --entrypoint "bash" -p 5722:22 -p 5780:8080 -v /data/:/data/ -itd triton_with_ft:22.08

docker exec -it triton_ft bash 5 . cd /data/mt/hbl/models/average1050000　 #　model path

CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/data/mt/hbl/models/average1050000/all_models/t5 --http-port 8080

python3 client.py

client.py

from hbl.triton_client_mt import client_init, MT5Req, pad_multi_lines, partition import gevent.ssl import numpy as np import tritonclient.http as httpclient import sentencepiece as sp from hbl.utils import elapsed_timer, read_paral_data from tqdm import tqdm import sys sys.path.append('/opt/app') from mttool.metric import metric

lang2short = {'English': 'en','Korean': 'ko', 'Japanese': 'ja', 'Chinese': 'zh', 'Bengali': 'bn', 'Filipino': 'fil', 'Hindi': 'hi', 'Indonesian': 'id', 'Lao': 'lo', 'Malay': 'ms', 'Thai': 'th', 'Urdu': 'ur', 'Vietnamese': 'vi', 'French': 'fr', 'Spanish': 'es', 'Italian': 'it', 'German': 'de'}

bleu_dict = {} for val in lang2short.values(): bleu_dict[val] = metric.Bleu(val)

def bleu_(preds, labels, lang): total = 0.0 d = {} for p, g, in zip(preds, labels): try: if lang not in d: d[lang] = ([p], [g]) else: d[lang][0].append(p) d[lang][1].append(g) except: print("===============================")

for lang in d:
    total += bleu_dict[lang].multi_evaluate_with_preprocess(d[lang][0], d[lang][1])
bleu = total / len(d)
return bleu

class Mt5Ft(MT5Req):

def __init__(self, host):
    super().__init__(host)

def preprocess(self, inputs, model, fr, to):
    lang1, lang2 = self.lang_dict[fr], self.lang_dict[to]
    prefix = "translate {} to {}: ".format(lang1, lang2)
    arr = [prefix + text for text in inputs]

    ids = [self.spms[model].encode(text) + [self.eos] for text in arr]
    return ids

def infer(self, model, ids,
        inputs='input_ids', sequence_length='sequence_length', max_output_len='max_output_len',
        output0='output_ids', output1='sequence_length',
        request_compression_algorithm=None,
        response_compression_algorithm=None):
    input_arr = []
    outputs = []
    bz = len(ids)
    seq_lens = [[len(arr)] for arr in ids]
    ids = pad_multi_lines(ids, self.pad)
    sl = len(ids[0])
    max_len = min(sl * 1.6 + 8, 256)
    outputs.append(httpclient.InferRequestedOutput(output0, binary_data=True))
    outputs.append(httpclient.InferRequestedOutput(output1, binary_data=True))
    input_arr.append(httpclient.InferInput("input_ids", [bz, sl], "UINT32"))
    input_arr.append(httpclient.InferInput("sequence_length", [bz, 1], "UINT32"))
    input_arr.append(httpclient.InferInput("max_output_len", [bz, 1], "UINT32"))

    input_arr[0].set_data_from_numpy(np.array(ids).astype(np.uint32), binary_data=False)
    # input_arr[0].set_data_from_numpy(np.array([[1, 1, 1, 1]]).astype(np.uint32), binary_data=False)
    input_arr[1].set_data_from_numpy(np.array(seq_lens).astype(np.uint32), binary_data=False)
    input_arr[2].set_data_from_numpy(np.array([[max_len]]*len(seq_lens)).astype(np.uint32), binary_data=False)

    results = client.infer(
        model_name=model,
        inputs=input_arr,
        outputs=outputs,
        # query_params=query_params,
        request_compression_algorithm=False,
        response_compression_algorithm=False, timeout=1000)

    output_ids = results.as_numpy(output0)
    decoder_lens = results.as_numpy(output1)
    return output_ids

if name == 'main':

client = client_init("10.5.210.91:5780")

cli = Mt5Ft(client)

print(cli.inference(['สวัสดีครับ', 'ที่ตั้งโรงแรมดี'], 'th', 'zh'))
dataset = read_paral_data('/data/bak/evaluation/bn_en_fil_hi_id_lo_ms_th_ur_vi_zh/eval/data/TestEval/TripAdvisor/comment/en_vi/origin',
                          'ta_comment', ['en', 'vi'])
fr, to = 'vi', 'en'
char_count = 0
for line in dataset['vi']:
    char_count += len(line)

for batch in [1,2,4,8]:
    if batch == 1:
        data = [[text]  for text in dataset[fr]]
    else:
        data = partition(dataset[fr], batch)

    rets = []
    with elapsed_timer() as elapsed:
        for par in tqdm(data):
            rets+= cli.inference(par, 'vi', 'en')
            elap = elapsed()
        score = bleu_(rets, dataset[to], to)
        print('batch: {}, cps:{}, elapsed: {}s bleu: {}'.format(batch, char_count / elap,
                                                                             '%.6f' % elap,score))

byshiue commented 1 year ago

Sorry, the client.py is not clear, can you organize it again? Also, please provide the scripts to run pytorch and config.pbtxt for triton.

PAOPAO6 commented 1 year ago

Please share the reproduced steps.

You can also use nsys to make sure you don't spend too many time on data copy.

How to test triton service with nsys？

PAOPAO6 commented 1 year ago

Sorry, the client.py is not clear, can you organize it again? Also, please provide the scripts to run pytorch and config.pbtxt for triton.

ok, let me reorganize

PAOPAO6 commented 1 year ago

Sorry, the client.py is not clear, can you organize it again? Also, please provide the scripts to run pytorch and config.pbtxt for triton.

I put the process and code on github: https://github.com/PAOPAO6/mt5_src Thank you very much

PAOPAO6 commented 1 year ago

Sorry, the client.py is not clear, can you organize it again? Also, please provide the scripts to run pytorch and config.pbtxt for triton.

Use branch main instead of t5_gptj_blog， There seems to be a problem with mt5 in t5_gptj_blog

byshiue commented 1 year ago

Do you test both cases on 22.10 docker image?

PAOPAO6 commented 1 year ago

I don't find the warmup before measuring time.

warmup at triton_ft_client.py line 103 : print(cli.inference(['สวัสดีครับ', 'ที่ตั้งโรงแรมดี'], 'th', 'zh')), this model support en zh vi th...

PAOPAO6 commented 1 year ago

Do you test both cases on 22.10 docker image?

Do you test both cases on 22.10 docker image?

ok i'll try

byshiue commented 1 year ago

I don't find the warmup before measuring time.

warmup at triton_ft_client.py line 103 : print(cli.inference(['สวัสดีครับ', 'ที่ตั้งโรงแรมดี'], 'th', 'zh')), this model support en zh vi th...

This warmup is not helpful. You should

run more iterations
run similar length of your inputs. I suggestion you run your evaluation on dataset two times, and only measure the time of second time.

PAOPAO6 commented 1 year ago

I don't find the warmup before measuring time.

warmup at triton_ft_client.py line 103 : print(cli.inference(['สวัสดีครับ', 'ที่ตั้งโรงแรมดี'], 'th', 'zh')), this model support en zh vi th...

This warmup is not helpful. You should

run more iterations

run similar length of your inputs. I suggestion you run your evaluation on dataset two times, and only measure the time of second time.

I run two times:

=================0 inference 100%|█████████████████████████████████████████| 976/976 [00:56<00:00, 17.14it/s] batch: 1, cps:2066.4064948907007, elapsed: 56.937974s bleu: 25.503309091716304 100%|█████████████████████████████████████████| 488/488 [00:41<00:00, 11.78it/s] batch: 2, cps:2839.7621163237663, elapsed: 41.431992s bleu: 25.531731907295494 100%|█████████████████████████████████████████| 244/244 [00:24<00:00, 9.79it/s] batch: 4, cps:4720.113977416375, elapsed: 24.926729s bleu: 25.514632321649643 100%|█████████████████████████████████████████| 122/122 [00:13<00:00, 9.14it/s] batch: 8, cps:8811.200837772447, elapsed: 13.353117s bleu: 25.525217557230196

=================1 inference 100%|█████████████████████████████████████████| 976/976 [00:57<00:00, 16.97it/s] batch: 1, cps:2045.819471090782, elapsed: 57.510940s bleu: 25.503309091716304 100%|█████████████████████████████████████████| 488/488 [00:41<00:00, 11.77it/s] batch: 2, cps:2837.588935681065, elapsed: 41.463722s bleu: 25.531731907295494 100%|█████████████████████████████████████████| 244/244 [00:24<00:00, 9.78it/s] batch: 4, cps:4715.9284737286835, elapsed: 24.948852s bleu: 25.514632321649643 100%|█████████████████████████████████████████| 122/122 [00:13<00:00, 9.13it/s] batch: 8, cps:8800.519316483293, elapsed: 13.369325s bleu: 25.525217557230196

PAOPAO6 commented 1 year ago

I don't find the warmup before measuring time.

warmup at triton_ft_client.py line 103 : print(cli.inference(['สวัสดีครับ', 'ที่ตั้งโรงแรมดี'], 'th', 'zh')), this model support en zh vi th...

This warmup is not helpful. You should

run more iterations

run similar length of your inputs. I suggestion you run your evaluation on dataset two times, and only measure the time of second time.

The utilization rate of triton gpu is relatively low, only 50%, and that of pytorch is greater than 70%

PAOPAO6 commented 1 year ago

I printed out the detailed log during inference：

the data is in CPU

http_server.cc:1092] HTTP: unable to provide 'sequence_length' in GPU, will use CPU

Will it affect the speed?

As shown below：

byshiue commented 1 year ago

I printed out the detailed log during inference：

the data is in CPU

http_server.cc:1092] HTTP: unable to provide 'sequence_length' in GPU, will use CPU

Will it affect the speed?

As shown below：

"The data is in CPU" is expected behavior. And copying data from GPU to CPU due to HTTP does not bring too many overhead.

Have you tested the latest main branch of both pytorch and triton on 22.10 docker image?

PAOPAO6 commented 1 year ago

Have you tested the latest main branch of both pytorch and triton on 22.10 docker image?

I tried it today. 22.10 has improved compared to 22.08, but it is still slower than pytorch。

22.10： =================1 inference 100%|█████████████████████████████████████████| 976/976 [00:54<00:00, 17.98it/s] batch: 1, cps:2167.4590032286687, elapsed: 54.283380s bleu: 25.502901557757863 100%|█████████████████████████████████████████| 488/488 [00:32<00:00, 14.85it/s] batch: 2, cps:3581.030120456587, elapsed: 32.855630s bleu: 25.531731907295494 100%|█████████████████████████████████████████| 244/244 [00:17<00:00, 13.79it/s] batch: 4, cps:6650.707768340559, elapsed: 17.690899s bleu: 25.514632321649643 100%|█████████████████████████████████████████| 122/122 [00:10<00:00, 12.15it/s] batch: 8, cps:11718.952584477438, elapsed: 10.039890s bleu: 25.525217557230196

22.08： =================1 inference 100%|█████████████████████████████████████████| 976/976 [00:57<00:00, 16.97it/s] batch: 1, cps:2045.819471090782, elapsed: 57.510940s bleu: 25.503309091716304 100%|█████████████████████████████████████████| 488/488 [00:41<00:00, 11.77it/s] batch: 2, cps:2837.588935681065, elapsed: 41.463722s bleu: 25.531731907295494 100%|█████████████████████████████████████████| 244/244 [00:24<00:00, 9.78it/s] batch: 4, cps:4715.9284737286835, elapsed: 24.948852s bleu: 25.514632321649643 100%|█████████████████████████████████████████| 122/122 [00:13<00:00, 9.13it/s] batch: 8, cps:8800.519316483293, elapsed: 13.369325s bleu: 25.525217557230196

pytorch op: ft: 100%|█████████████████████████████████████| 976/976 [00:46<00:00, 20.85it/s] engine: ft, batch: 1, cps:2513.5870329060313, elapsed: 46.808405s, bleu: 25.534268557480004 ft: 100%|█████████████████████████████████████| 488/488 [00:26<00:00, 18.49it/s] engine: ft, batch: 2, cps:4457.269456353881, elapsed: 26.396654s, bleu: 25.542048594569284 ft: 100%|█████████████████████████████████████| 244/244 [00:14<00:00, 17.09it/s] engine: ft, batch: 4, cps:8239.068868524504, elapsed: 14.280376s, bleu: 25.502996135441286 ft: 100%|█████████████████████████████████████| 122/122 [00:07<00:00, 15.83it/s] engine: ft, batch: 8, cps:15264.420564754077, elapsed: 7.707924s, bleu: 25.52744772193587

PAOPAO6 commented 1 year ago

I found that gemm_config.in didn't take effect in triton backend， is there a problem？

with gemm_config.in: 100%|█████████████████████████████████████████| 976/976 [00:53<00:00, 18.08it/s] batch: 1, cps:2179.450617536946, elapsed: 53.984706s bleu: 25.527662305065324 100%|█████████████████████████████████████████| 488/488 [00:32<00:00, 15.05it/s] batch: 2, cps:3629.244101552386, elapsed: 32.419148s bleu: 25.53190962260741 100%|█████████████████████████████████████████| 244/244 [00:17<00:00, 13.74it/s] batch: 4, cps:6622.822094543999, elapsed: 17.765387s bleu: 25.514632321649643 100%|█████████████████████████████████████████| 122/122 [00:10<00:00, 12.19it/s] batch: 8, cps:11751.884603630127, elapsed: 10.011756s bleu: 25.525217557230196

with out gemm_config.in: 100%|█████████████████████████████████████████| 976/976 [00:54<00:00, 17.87it/s] batch: 1, cps:2154.4245057765884, elapsed: 54.611800s bleu: 25.502901557757863 100%|█████████████████████████████████████████| 488/488 [00:33<00:00, 14.70it/s] batch: 2, cps:3544.604114248747, elapsed: 33.193270s bleu: 25.531731907295494 100%|█████████████████████████████████████████| 244/244 [00:17<00:00, 13.62it/s] batch: 4, cps:6567.3575424495375, elapsed: 17.915425s bleu: 25.514632321649643 100%|█████████████████████████████████████████| 122/122 [00:10<00:00, 12.05it/s] batch: 8, cps:11618.282080653826, elapsed: 10.126884s bleu: 25.525217557230196

byshiue commented 1 year ago

Have you tested the latest main branch of both pytorch and triton on 22.10 docker image?

I tried it today. 22.10 has improved compared to 22.08, but it is still slower than pytorch。

22.10： =================1 inference 100%|█████████████████████████████████████████| 976/976 [00:54<00:00, 17.98it/s] batch: 1, cps:2167.4590032286687, elapsed: 54.283380s bleu: 25.502901557757863 100%|█████████████████████████████████████████| 488/488 [00:32<00:00, 14.85it/s] batch: 2, cps:3581.030120456587, elapsed: 32.855630s bleu: 25.531731907295494 100%|█████████████████████████████████████████| 244/244 [00:17<00:00, 13.79it/s] batch: 4, cps:6650.707768340559, elapsed: 17.690899s bleu: 25.514632321649643 100%|█████████████████████████████████████████| 122/122 [00:10<00:00, 12.15it/s] batch: 8, cps:11718.952584477438, elapsed: 10.039890s bleu: 25.525217557230196

22.08： =================1 inference 100%|█████████████████████████████████████████| 976/976 [00:57<00:00, 16.97it/s] batch: 1, cps:2045.819471090782, elapsed: 57.510940s bleu: 25.503309091716304 100%|█████████████████████████████████████████| 488/488 [00:41<00:00, 11.77it/s] batch: 2, cps:2837.588935681065, elapsed: 41.463722s bleu: 25.531731907295494 100%|█████████████████████████████████████████| 244/244 [00:24<00:00, 9.78it/s] batch: 4, cps:4715.9284737286835, elapsed: 24.948852s bleu: 25.514632321649643 100%|█████████████████████████████████████████| 122/122 [00:13<00:00, 9.13it/s] batch: 8, cps:8800.519316483293, elapsed: 13.369325s bleu: 25.525217557230196

pytorch op: ft: 100%|█████████████████████████████████████| 976/976 [00:46<00:00, 20.85it/s] engine: ft, batch: 1, cps:2513.5870329060313, elapsed: 46.808405s, bleu: 25.534268557480004 ft: 100%|█████████████████████████████████████| 488/488 [00:26<00:00, 18.49it/s] engine: ft, batch: 2, cps:4457.269456353881, elapsed: 26.396654s, bleu: 25.542048594569284 ft: 100%|█████████████████████████████████████| 244/244 [00:14<00:00, 17.09it/s] engine: ft, batch: 4, cps:8239.068868524504, elapsed: 14.280376s, bleu: 25.502996135441286 ft: 100%|█████████████████████████████████████| 122/122 [00:07<00:00, 15.83it/s] engine: ft, batch: 8, cps:15264.420564754077, elapsed: 7.707924s, bleu: 25.52744772193587

Can you reproduce the issue by the examples of repo? Besides, it is strange that the scores of triton and pytorch are different. They are fully same in our examples.

For gemm test, it only brings benefit when the performance of default gemm algo and best gemm algo are difference. In most cases, using gemm test or not may not affect the performance a lot.

PAOPAO6 commented 1 year ago

Can you reproduce the issue by the examples of repo? Besides, it is strange that the scores of triton and pytorch are different. They are fully same in our examples.

For gemm test, it only brings benefit when the performance of default gemm algo and best gemm algo are difference. In most cases, using gemm test or not may not affect the performance a lot.

@byshiue I can reproduce the issue, and if necessary, I can provide you with test samples, test code and model, please let me know your email address

byshiue commented 1 year ago

Can you reproduce the issue by the examples of repo? Besides, it is strange that the scores of triton and pytorch are different. They are fully same in our examples. For gemm test, it only brings benefit when the performance of default gemm algo and best gemm algo are difference. In most cases, using gemm test or not may not affect the performance a lot.

@byshiue I can reproduce the issue, and if necessary, I can provide you with test samples, test code and model, please let me know your email address

You can share the reproduced step by the examples we provide here.

PAOPAO6 commented 1 year ago

Can you reproduce the issue by the examples of repo? Besides, it is strange that the scores of triton and pytorch are different. They are fully same in our examples. For gemm test, it only brings benefit when the performance of default gemm algo and best gemm algo are difference. In most cases, using gemm test or not may not affect the performance a lot.

@byshiue I can reproduce the issue, and if necessary, I can provide you with test samples, test code and model, please let me know your email address

You can share the reproduced step by the examples we provide here.

ok i try

PAOPAO6 commented 1 year ago

Can you reproduce the issue by the examples of repo? Besides, it is strange that the scores of triton and pytorch are different. They are fully same in our examples. For gemm test, it only brings benefit when the performance of default gemm algo and best gemm algo are difference. In most cases, using gemm test or not may not affect the performance a lot.

@byshiue I can reproduce the issue, and if necessary, I can provide you with test samples, test code and model, please let me know your email address

You can share the reproduced step by the examples we provide here.

[the examples we provide here] which examples are referred to, I do not quite understand

byshiue commented 1 year ago

Can you reproduce the issue by the examples of repo? Besides, it is strange that the scores of triton and pytorch are different. They are fully same in our examples. For gemm test, it only brings benefit when the performance of default gemm algo and best gemm algo are difference. In most cases, using gemm test or not may not affect the performance a lot.

@byshiue I can reproduce the issue, and if necessary, I can provide you with test samples, test code and model, please let me know your email address

You can share the reproduced step by the examples we provide here.

[the examples we provide here] which examples are referred to, I do not quite understand

You can use translate_example.py of FasterTransformer, and t5_end_to_end_test.py of fastertransformer_backend.

PAOPAO6 commented 1 year ago

@byshiue

You can use translate_example.py of FasterTransformer, and t5_end_to_end_test.py of fastertransformer_backend.

The precision is fp16

ouput of translate_example.py of FasterTransformer: [INFO] MPI is not available in this PyTorch build. [INFO] MPI is not available in this PyTorch build. [INFO] MPI is not available in this PyTorch build. [INFO] MPI is not available in this PyTorch build. 2023-01-10 12:35:33,508 main [INFO] bleu score: 26.21 2023-01-10 12:35:33,508 main [INFO] bleu counts: [36344, 19103, 11208, 6847] 2023-01-10 12:35:33,508 main [INFO] bleu totals: [62577, 59573, 56569, 53565] 2023-01-10 12:35:33,508 main [INFO] bleu precisions: [58.07884686066766, 32.06654021116949, 19.81297176899008, 12.782600578736115] 2023-01-10 12:35:33,508 main [INFO] bleu sys_len: 62577; ref_len: 61287 2023-01-10 12:35:50,571 main [INFO] bleu score: 26.41 2023-01-10 12:35:50,571 main [INFO] bleu counts: [36146, 18978, 11135, 6790] 2023-01-10 12:35:50,571 main [INFO] bleu totals: [61753, 58749, 55745, 52741] 2023-01-10 12:35:50,571 main [INFO] bleu precisions: [58.53318867099574, 32.30352857069908, 19.97488563996771, 12.874234466544054] 2023-01-10 12:35:50,571 main [INFO] bleu sys_len: 61753; ref_len: 61287 2023-01-10 12:39:39,900 main [INFO] bleu score: 17.93 2023-01-10 12:39:39,900 main [INFO] bleu counts: [31162, 13693, 7037, 3761] 2023-01-10 12:39:39,900 main [INFO] bleu totals: [62104, 59100, 56096, 53092] 2023-01-10 12:39:39,900 main [INFO] bleu precisions: [50.17712224655417, 23.169204737732656, 12.544566457501427, 7.083929782264748] 2023-01-10 12:39:39,900 main [INFO] bleu sys_len: 62104; ref_len: 61287 2023-01-10 12:39:55,028 main [INFO] bleu score: 17.64 2023-01-10 12:39:55,028 main [INFO] bleu counts: [30650, 13361, 6897, 3743] 2023-01-10 12:39:55,028 main [INFO] bleu totals: [62098, 59094, 56090, 53086] 2023-01-10 12:39:55,028 main [INFO] bleu precisions: [49.35746722921833, 22.60974041357837, 12.296309502585132, 7.050823192555476] 2023-01-10 12:39:55,028 main [INFO] bleu sys_len: 62098; ref_len: 61287 2023-01-10 12:39:55,031 main [INFO] hf-beamsearch translates 94 batches taking 372.09 sec to translate 101007 tokens, BLEU score: 26.21, 271 tokens/sec. (62577 words, 168 words/sec) 2023-01-10 12:39:55,032 main [INFO] ft-beamsearch translates 94 batches taking 13.74 sec to translate 98938 tokens, BLEU score: 26.41, 7203 tokens/sec. (61753 words, 4496 words/sec) 2023-01-10 12:39:55,032 main [INFO] hf-sampling translates 94 batches taking 199.64 sec to translate 100897 tokens, BLEU score: 17.93, 505 tokens/sec. (62104 words, 311 words/sec) 2023-01-10 12:39:55,032 main [INFO] ft-sampling translates 94 batches taking 12.00 sec to translate 101637 tokens, BLEU score: 17.64, 8473 tokens/sec. (62098 words, 5177 words/sec)

t5_end_to_end_test.py result: bleu score: 25.36 bleu counts: [35704, 18414, 10664, 6414] bleu totals: [62034, 59030, 56026, 53022] bleu precisions: [57.55553406196602, 31.19430797899373, 19.03401991932317, 12.096865452076496] bleu sys_len: 62034; ref_len: 61287 [INFO] ft_triton translates 94 batches taking 18.65 sec to translate 62034 tokens, BLEU score: 25.36, 3327 tokens/sec.

byshiue commented 1 year ago

@byshiue

You can use translate_example.py of FasterTransformer, and t5_end_to_end_test.py of fastertransformer_backend.

The precision is fp16

ouput of translate_example.py of FasterTransformer: [INFO] MPI is not available in this PyTorch build. [INFO] MPI is not available in this PyTorch build. [INFO] MPI is not available in this PyTorch build. [INFO] MPI is not available in this PyTorch build. 2023-01-10 12:35:33,508 main [INFO] bleu score: 26.21 2023-01-10 12:35:33,508 main [INFO] bleu counts: [36344, 19103, 11208, 6847] 2023-01-10 12:35:33,508 main [INFO] bleu totals: [62577, 59573, 56569, 53565] 2023-01-10 12:35:33,508 main [INFO] bleu precisions: [58.07884686066766, 32.06654021116949, 19.81297176899008, 12.782600578736115] 2023-01-10 12:35:33,508 main [INFO] bleu sys_len: 62577; ref_len: 61287 2023-01-10 12:35:50,571 main [INFO] bleu score: 26.41 2023-01-10 12:35:50,571 main [INFO] bleu counts: [36146, 18978, 11135, 6790] 2023-01-10 12:35:50,571 main [INFO] bleu totals: [61753, 58749, 55745, 52741] 2023-01-10 12:35:50,571 main [INFO] bleu precisions: [58.53318867099574, 32.30352857069908, 19.97488563996771, 12.874234466544054] 2023-01-10 12:35:50,571 main [INFO] bleu sys_len: 61753; ref_len: 61287 2023-01-10 12:39:39,900 main [INFO] bleu score: 17.93 2023-01-10 12:39:39,900 main [INFO] bleu counts: [31162, 13693, 7037, 3761] 2023-01-10 12:39:39,900 main [INFO] bleu totals: [62104, 59100, 56096, 53092] 2023-01-10 12:39:39,900 main [INFO] bleu precisions: [50.17712224655417, 23.169204737732656, 12.544566457501427, 7.083929782264748] 2023-01-10 12:39:39,900 main [INFO] bleu sys_len: 62104; ref_len: 61287 2023-01-10 12:39:55,028 main [INFO] bleu score: 17.64 2023-01-10 12:39:55,028 main [INFO] bleu counts: [30650, 13361, 6897, 3743] 2023-01-10 12:39:55,028 main [INFO] bleu totals: [62098, 59094, 56090, 53086] 2023-01-10 12:39:55,028 main [INFO] bleu precisions: [49.35746722921833, 22.60974041357837, 12.296309502585132, 7.050823192555476] 2023-01-10 12:39:55,028 main [INFO] bleu sys_len: 62098; ref_len: 61287 2023-01-10 12:39:55,031 main [INFO] hf-beamsearch translates 94 batches taking 372.09 sec to translate 101007 tokens, BLEU score: 26.21, 271 tokens/sec. (62577 words, 168 words/sec) 2023-01-10 12:39:55,032 main [INFO] ft-beamsearch translates 94 batches taking 13.74 sec to translate 98938 tokens, BLEU score: 26.41, 7203 tokens/sec. (61753 words, 4496 words/sec) 2023-01-10 12:39:55,032 main [INFO] hf-sampling translates 94 batches taking 199.64 sec to translate 100897 tokens, BLEU score: 17.93, 505 tokens/sec. (62104 words, 311 words/sec) 2023-01-10 12:39:55,032 main [INFO] ft-sampling translates 94 batches taking 12.00 sec to translate 101637 tokens, BLEU score: 17.64, 8473 tokens/sec. (62098 words, 5177 words/sec)

t5_end_to_end_test.py result: bleu score: 25.36 bleu counts: [35704, 18414, 10664, 6414] bleu totals: [62034, 59030, 56026, 53022] bleu precisions: [57.55553406196602, 31.19430797899373, 19.03401991932317, 12.096865452076496] bleu sys_len: 62034; ref_len: 61287 [INFO] ft_triton translates 94 batches taking 18.65 sec to translate 62034 tokens, BLEU score: 25.36, 3327 tokens/sec.

Please provide the end-to-end reproduced steps, including the docker image, the arguments to run the examples. Thank you. The BLEU scores on triton and pytorch are different. That means that your tests are not on same page.

PAOPAO6 commented 1 year ago

@byshiue

Please provide the end-to-end reproduced steps, including the docker image, the arguments to run the examples. Thank you. The BLEU scores on triton and pytorch are different. That means that your tests are not on same page.

ok ok, translate_example.py of FasterTransformer steps:

images: pytorch:21.11-py3
build: cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_TRT=ON -DBUILD_MULTI_GPU=ON .. make -j64 pip3 install -r ../examples/pytorch/t5/requirement.txt
python3 ../examples/pytorch/t5/translate_example.py \ --batch_size 32 \ --beam_width 4 \ --max_seq_len 128 \ --data_type fp16 \ --test_time 0123 \ --sampling_topk 4 \ ---model t5-small

t5_end_to_end_test.py result:

image: tritonserver:22.12-py3
python3 ../examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py \ -saved_dir all_models/t5/ \ -in_file t5-small/ \ -inference_tensor_para_size 1 \ -weight_data_type fp16
/workspace/build/fastertransformer_backend/build/bin/t5_gemm 8 4 32 512 8 64 2048 512 8 64 2048 32128 1 1 1 CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver \ --model-repository=${WORKSPACE}/all_models/t5/ &

cd /workspace/build/fastertransformer_backend/ python3 tools/t5_utils/t5_end_to_end_test.py --batch_size 32

config.pbtxt.txt

byshiue commented 1 year ago

Please try both cases on 22.12 docker image to make sure they use same version of CUDA. And the gemm test you run in triton does not match your case. It should be

./bin/t5_gemm 32 1 128 512 8 64 2048 512 8 64 2048 32128 1 1 0

PAOPAO6 commented 1 year ago

cuda version: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0
step:

/workspace/build/fastertransformer_backend/build/bin/t5_gemm 32 1 128 512 8 64 2048 512 8 64 2048 32128 1 1 0 CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver \ --model-repository=${WORKSPACE}/all_models/t5/ &

cd /workspace/build/fastertransformer_backend/ python3 tools/t5_utils/t5_end_to_end_test.py --batch_size 32

result:

   bleu score:  25.37
   bleu counts: [35703, 18416, 10672, 6419]
   bleu totals: [62018, 59014, 56010, 53006]
   bleu precisions: [57.568770356993134, 31.206154471820245, 19.053740403499376, 12.10994981700185]
   bleu sys_len: 62018; ref_len: 61287

[INFO] ft_triton translates 94 batches taking 18.21 sec to translate 62018 tokens, BLEU score: 25.37, 3406 tokens/sec.

byshiue commented 1 year ago

Please try both cases on 22.12 docker image to make sure they use same version of CUDA. Also, please provide the GPU you use, and try to reproduce your issue on different machines and different GPUs to claim it is a common issue.

PAOPAO6 commented 1 year ago

@byshiue Is it possible that it has something to do with the gpu driver version, my driver version is relatively low 460.73.01

byshiue commented 1 year ago

@byshiue Is it possible that it has something to do with the gpu driver version, my driver version is relatively low 460.73.01

I am not really sure. Environment, driver and CUDA version all have some impact. So, it is better to reproduce on several environment.

PAOPAO6 commented 1 year ago

@byshiue Is it possible that it has something to do with the gpu driver version, my driver version is relatively low 460.73.01

I am not really sure. Environment, driver and CUDA version all have some impact. So, it is better to reproduce on several environment.

ok I will test again using the same environment

PAOPAO6 commented 1 year ago

@byshiue i run in gpu dirver 470.82, The speed is normal, consistent with pytorch， thank you very much

triton-inference-server / fastertransformer_backend

After triton fastertransformer backend, the inference speed is severely reduced #78

Description

Reproduced Steps