Closed PAOPAO6 closed 1 year ago
Please share the reproduced steps.
You can also use nsys to make sure you don't spend too many time on data copy.
Please share the reproduced steps.
You can also use nsys to make sure you don't spend too many time on data copy.
the reproduced steps:
git clone https://github.com/triton-inference-server/fastertransformer_backend.git cd fastertransformer_backend && git checkout -b t5_gptj_blog remotes/origin/dev/t5_gptj_blog
docker build --rm --build-arg TRITON_VERSION=22.08 -t triton_with_ft:22.08 -f docker/Dockerfile .
docker run -e NVIDIA_VISIBLE_DEVICES=0 --name triton_ft --shm-size=4G --entrypoint "bash" -p 5722:22 -p 5780:8080 -v /data/:/data/ -itd triton_with_ft:22.08
docker exec -it triton_ft bash 5 . cd /data/mt/hbl/models/average1050000 # model path
CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/data/mt/hbl/models/average1050000/all_models/t5 --http-port 8080
python3 client.py
Please share the reproduced steps. You can also use nsys to make sure you don't spend too many time on data copy.
the reproduced steps:
git clone https://github.com/triton-inference-server/fastertransformer_backend.git cd fastertransformer_backend && git checkout -b t5_gptj_blog remotes/origin/dev/t5_gptj_blog
docker build --rm --build-arg TRITON_VERSION=22.08 -t triton_with_ft:22.08 -f docker/Dockerfile .
docker run -e NVIDIA_VISIBLE_DEVICES=0 --name triton_ft --shm-size=4G --entrypoint "bash" -p 5722:22 -p 5780:8080 -v /data/:/data/ -itd triton_with_ft:22.08
docker exec -it triton_ft bash 5 . cd /data/mt/hbl/models/average1050000 # model path
CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver --model-repository=/data/mt/hbl/models/average1050000/all_models/t5 --http-port 8080
python3 client.py
client.py
from hbl.triton_client_mt import client_init, MT5Req, pad_multi_lines, partition import gevent.ssl import numpy as np import tritonclient.http as httpclient import sentencepiece as sp from hbl.utils import elapsed_timer, read_paral_data from tqdm import tqdm import sys sys.path.append('/opt/app') from mttool.metric import metric
lang2short = {'English': 'en','Korean': 'ko', 'Japanese': 'ja', 'Chinese': 'zh', 'Bengali': 'bn', 'Filipino': 'fil', 'Hindi': 'hi', 'Indonesian': 'id', 'Lao': 'lo', 'Malay': 'ms', 'Thai': 'th', 'Urdu': 'ur', 'Vietnamese': 'vi', 'French': 'fr', 'Spanish': 'es', 'Italian': 'it', 'German': 'de'}
bleu_dict = {} for val in lang2short.values(): bleu_dict[val] = metric.Bleu(val)
def bleu_(preds, labels, lang): total = 0.0 d = {} for p, g, in zip(preds, labels): try: if lang not in d: d[lang] = ([p], [g]) else: d[lang][0].append(p) d[lang][1].append(g) except: print("===============================")
for lang in d:
total += bleu_dict[lang].multi_evaluate_with_preprocess(d[lang][0], d[lang][1])
bleu = total / len(d)
return bleu
class Mt5Ft(MT5Req):
def __init__(self, host):
super().__init__(host)
def preprocess(self, inputs, model, fr, to):
lang1, lang2 = self.lang_dict[fr], self.lang_dict[to]
prefix = "translate {} to {}: ".format(lang1, lang2)
arr = [prefix + text for text in inputs]
ids = [self.spms[model].encode(text) + [self.eos] for text in arr]
return ids
def infer(self, model, ids,
inputs='input_ids', sequence_length='sequence_length', max_output_len='max_output_len',
output0='output_ids', output1='sequence_length',
request_compression_algorithm=None,
response_compression_algorithm=None):
input_arr = []
outputs = []
bz = len(ids)
seq_lens = [[len(arr)] for arr in ids]
ids = pad_multi_lines(ids, self.pad)
sl = len(ids[0])
max_len = min(sl * 1.6 + 8, 256)
outputs.append(httpclient.InferRequestedOutput(output0, binary_data=True))
outputs.append(httpclient.InferRequestedOutput(output1, binary_data=True))
input_arr.append(httpclient.InferInput("input_ids", [bz, sl], "UINT32"))
input_arr.append(httpclient.InferInput("sequence_length", [bz, 1], "UINT32"))
input_arr.append(httpclient.InferInput("max_output_len", [bz, 1], "UINT32"))
input_arr[0].set_data_from_numpy(np.array(ids).astype(np.uint32), binary_data=False)
# input_arr[0].set_data_from_numpy(np.array([[1, 1, 1, 1]]).astype(np.uint32), binary_data=False)
input_arr[1].set_data_from_numpy(np.array(seq_lens).astype(np.uint32), binary_data=False)
input_arr[2].set_data_from_numpy(np.array([[max_len]]*len(seq_lens)).astype(np.uint32), binary_data=False)
results = client.infer(
model_name=model,
inputs=input_arr,
outputs=outputs,
# query_params=query_params,
request_compression_algorithm=False,
response_compression_algorithm=False, timeout=1000)
output_ids = results.as_numpy(output0)
decoder_lens = results.as_numpy(output1)
return output_ids
if name == 'main':
client = client_init("10.5.210.91:5780")
cli = Mt5Ft(client)
print(cli.inference(['สวัสดีครับ', 'ที่ตั้งโรงแรมดี'], 'th', 'zh'))
dataset = read_paral_data('/data/bak/evaluation/bn_en_fil_hi_id_lo_ms_th_ur_vi_zh/eval/data/TestEval/TripAdvisor/comment/en_vi/origin',
'ta_comment', ['en', 'vi'])
fr, to = 'vi', 'en'
char_count = 0
for line in dataset['vi']:
char_count += len(line)
for batch in [1,2,4,8]:
if batch == 1:
data = [[text] for text in dataset[fr]]
else:
data = partition(dataset[fr], batch)
rets = []
with elapsed_timer() as elapsed:
for par in tqdm(data):
rets+= cli.inference(par, 'vi', 'en')
elap = elapsed()
score = bleu_(rets, dataset[to], to)
print('batch: {}, cps:{}, elapsed: {}s bleu: {}'.format(batch, char_count / elap,
'%.6f' % elap,score))
Sorry, the client.py
is not clear, can you organize it again?
Also, please provide the scripts to run pytorch and config.pbtxt
for triton.
Please share the reproduced steps.
You can also use nsys to make sure you don't spend too many time on data copy.
How to test triton service with nsys?
Sorry, the
client.py
is not clear, can you organize it again? Also, please provide the scripts to run pytorch andconfig.pbtxt
for triton.
ok, let me reorganize
Sorry, the
client.py
is not clear, can you organize it again? Also, please provide the scripts to run pytorch andconfig.pbtxt
for triton.
I put the process and code on github: https://github.com/PAOPAO6/mt5_src Thank you very much
Sorry, the
client.py
is not clear, can you organize it again? Also, please provide the scripts to run pytorch andconfig.pbtxt
for triton.
Use branch main instead of t5_gptj_blog, There seems to be a problem with mt5 in t5_gptj_blog
Do you test both cases on 22.10 docker image?
I don't find the warmup before measuring time.
warmup at triton_ft_client.py line 103 : print(cli.inference(['สวัสดีครับ', 'ที่ตั้งโรงแรมดี'], 'th', 'zh')), this model support en zh vi th...
Do you test both cases on 22.10 docker image?
Do you test both cases on 22.10 docker image?
ok i'll try
I don't find the warmup before measuring time.
warmup at triton_ft_client.py line 103 : print(cli.inference(['สวัสดีครับ', 'ที่ตั้งโรงแรมดี'], 'th', 'zh')), this model support en zh vi th...
This warmup is not helpful. You should
I don't find the warmup before measuring time.
warmup at triton_ft_client.py line 103 : print(cli.inference(['สวัสดีครับ', 'ที่ตั้งโรงแรมดี'], 'th', 'zh')), this model support en zh vi th...
This warmup is not helpful. You should
- run more iterations
- run similar length of your inputs. I suggestion you run your evaluation on dataset two times, and only measure the time of second time.
I run two times:
=================0 inference 100%|█████████████████████████████████████████| 976/976 [00:56<00:00, 17.14it/s] batch: 1, cps:2066.4064948907007, elapsed: 56.937974s bleu: 25.503309091716304 100%|█████████████████████████████████████████| 488/488 [00:41<00:00, 11.78it/s] batch: 2, cps:2839.7621163237663, elapsed: 41.431992s bleu: 25.531731907295494 100%|█████████████████████████████████████████| 244/244 [00:24<00:00, 9.79it/s] batch: 4, cps:4720.113977416375, elapsed: 24.926729s bleu: 25.514632321649643 100%|█████████████████████████████████████████| 122/122 [00:13<00:00, 9.14it/s] batch: 8, cps:8811.200837772447, elapsed: 13.353117s bleu: 25.525217557230196
=================1 inference 100%|█████████████████████████████████████████| 976/976 [00:57<00:00, 16.97it/s] batch: 1, cps:2045.819471090782, elapsed: 57.510940s bleu: 25.503309091716304 100%|█████████████████████████████████████████| 488/488 [00:41<00:00, 11.77it/s] batch: 2, cps:2837.588935681065, elapsed: 41.463722s bleu: 25.531731907295494 100%|█████████████████████████████████████████| 244/244 [00:24<00:00, 9.78it/s] batch: 4, cps:4715.9284737286835, elapsed: 24.948852s bleu: 25.514632321649643 100%|█████████████████████████████████████████| 122/122 [00:13<00:00, 9.13it/s] batch: 8, cps:8800.519316483293, elapsed: 13.369325s bleu: 25.525217557230196
I don't find the warmup before measuring time.
warmup at triton_ft_client.py line 103 : print(cli.inference(['สวัสดีครับ', 'ที่ตั้งโรงแรมดี'], 'th', 'zh')), this model support en zh vi th...
This warmup is not helpful. You should
- run more iterations
- run similar length of your inputs. I suggestion you run your evaluation on dataset two times, and only measure the time of second time.
The utilization rate of triton gpu is relatively low, only 50%, and that of pytorch is greater than 70%
I printed out the detailed log during inference:
the data is in CPU
http_server.cc:1092] HTTP: unable to provide 'sequence_length' in GPU, will use CPU
Will it affect the speed?
As shown below:
I printed out the detailed log during inference:
the data is in CPU
http_server.cc:1092] HTTP: unable to provide 'sequence_length' in GPU, will use CPU
Will it affect the speed?
As shown below:
"The data is in CPU" is expected behavior. And copying data from GPU to CPU due to HTTP does not bring too many overhead.
Have you tested the latest main branch of both pytorch and triton on 22.10 docker image?
Have you tested the latest main branch of both pytorch and triton on 22.10 docker image?
I tried it today. 22.10 has improved compared to 22.08, but it is still slower than pytorch。
22.10: =================1 inference 100%|█████████████████████████████████████████| 976/976 [00:54<00:00, 17.98it/s] batch: 1, cps:2167.4590032286687, elapsed: 54.283380s bleu: 25.502901557757863 100%|█████████████████████████████████████████| 488/488 [00:32<00:00, 14.85it/s] batch: 2, cps:3581.030120456587, elapsed: 32.855630s bleu: 25.531731907295494 100%|█████████████████████████████████████████| 244/244 [00:17<00:00, 13.79it/s] batch: 4, cps:6650.707768340559, elapsed: 17.690899s bleu: 25.514632321649643 100%|█████████████████████████████████████████| 122/122 [00:10<00:00, 12.15it/s] batch: 8, cps:11718.952584477438, elapsed: 10.039890s bleu: 25.525217557230196
22.08: =================1 inference 100%|█████████████████████████████████████████| 976/976 [00:57<00:00, 16.97it/s] batch: 1, cps:2045.819471090782, elapsed: 57.510940s bleu: 25.503309091716304 100%|█████████████████████████████████████████| 488/488 [00:41<00:00, 11.77it/s] batch: 2, cps:2837.588935681065, elapsed: 41.463722s bleu: 25.531731907295494 100%|█████████████████████████████████████████| 244/244 [00:24<00:00, 9.78it/s] batch: 4, cps:4715.9284737286835, elapsed: 24.948852s bleu: 25.514632321649643 100%|█████████████████████████████████████████| 122/122 [00:13<00:00, 9.13it/s] batch: 8, cps:8800.519316483293, elapsed: 13.369325s bleu: 25.525217557230196
pytorch op: ft: 100%|█████████████████████████████████████| 976/976 [00:46<00:00, 20.85it/s] engine: ft, batch: 1, cps:2513.5870329060313, elapsed: 46.808405s, bleu: 25.534268557480004 ft: 100%|█████████████████████████████████████| 488/488 [00:26<00:00, 18.49it/s] engine: ft, batch: 2, cps:4457.269456353881, elapsed: 26.396654s, bleu: 25.542048594569284 ft: 100%|█████████████████████████████████████| 244/244 [00:14<00:00, 17.09it/s] engine: ft, batch: 4, cps:8239.068868524504, elapsed: 14.280376s, bleu: 25.502996135441286 ft: 100%|█████████████████████████████████████| 122/122 [00:07<00:00, 15.83it/s] engine: ft, batch: 8, cps:15264.420564754077, elapsed: 7.707924s, bleu: 25.52744772193587
I found that gemm_config.in didn't take effect in triton backend, is there a problem?
with gemm_config.in: 100%|█████████████████████████████████████████| 976/976 [00:53<00:00, 18.08it/s] batch: 1, cps:2179.450617536946, elapsed: 53.984706s bleu: 25.527662305065324 100%|█████████████████████████████████████████| 488/488 [00:32<00:00, 15.05it/s] batch: 2, cps:3629.244101552386, elapsed: 32.419148s bleu: 25.53190962260741 100%|█████████████████████████████████████████| 244/244 [00:17<00:00, 13.74it/s] batch: 4, cps:6622.822094543999, elapsed: 17.765387s bleu: 25.514632321649643 100%|█████████████████████████████████████████| 122/122 [00:10<00:00, 12.19it/s] batch: 8, cps:11751.884603630127, elapsed: 10.011756s bleu: 25.525217557230196
with out gemm_config.in: 100%|█████████████████████████████████████████| 976/976 [00:54<00:00, 17.87it/s] batch: 1, cps:2154.4245057765884, elapsed: 54.611800s bleu: 25.502901557757863 100%|█████████████████████████████████████████| 488/488 [00:33<00:00, 14.70it/s] batch: 2, cps:3544.604114248747, elapsed: 33.193270s bleu: 25.531731907295494 100%|█████████████████████████████████████████| 244/244 [00:17<00:00, 13.62it/s] batch: 4, cps:6567.3575424495375, elapsed: 17.915425s bleu: 25.514632321649643 100%|█████████████████████████████████████████| 122/122 [00:10<00:00, 12.05it/s] batch: 8, cps:11618.282080653826, elapsed: 10.126884s bleu: 25.525217557230196
Have you tested the latest main branch of both pytorch and triton on 22.10 docker image?
I tried it today. 22.10 has improved compared to 22.08, but it is still slower than pytorch。
22.10: =================1 inference 100%|█████████████████████████████████████████| 976/976 [00:54<00:00, 17.98it/s] batch: 1, cps:2167.4590032286687, elapsed: 54.283380s bleu: 25.502901557757863 100%|█████████████████████████████████████████| 488/488 [00:32<00:00, 14.85it/s] batch: 2, cps:3581.030120456587, elapsed: 32.855630s bleu: 25.531731907295494 100%|█████████████████████████████████████████| 244/244 [00:17<00:00, 13.79it/s] batch: 4, cps:6650.707768340559, elapsed: 17.690899s bleu: 25.514632321649643 100%|█████████████████████████████████████████| 122/122 [00:10<00:00, 12.15it/s] batch: 8, cps:11718.952584477438, elapsed: 10.039890s bleu: 25.525217557230196
22.08: =================1 inference 100%|█████████████████████████████████████████| 976/976 [00:57<00:00, 16.97it/s] batch: 1, cps:2045.819471090782, elapsed: 57.510940s bleu: 25.503309091716304 100%|█████████████████████████████████████████| 488/488 [00:41<00:00, 11.77it/s] batch: 2, cps:2837.588935681065, elapsed: 41.463722s bleu: 25.531731907295494 100%|█████████████████████████████████████████| 244/244 [00:24<00:00, 9.78it/s] batch: 4, cps:4715.9284737286835, elapsed: 24.948852s bleu: 25.514632321649643 100%|█████████████████████████████████████████| 122/122 [00:13<00:00, 9.13it/s] batch: 8, cps:8800.519316483293, elapsed: 13.369325s bleu: 25.525217557230196
pytorch op: ft: 100%|█████████████████████████████████████| 976/976 [00:46<00:00, 20.85it/s] engine: ft, batch: 1, cps:2513.5870329060313, elapsed: 46.808405s, bleu: 25.534268557480004 ft: 100%|█████████████████████████████████████| 488/488 [00:26<00:00, 18.49it/s] engine: ft, batch: 2, cps:4457.269456353881, elapsed: 26.396654s, bleu: 25.542048594569284 ft: 100%|█████████████████████████████████████| 244/244 [00:14<00:00, 17.09it/s] engine: ft, batch: 4, cps:8239.068868524504, elapsed: 14.280376s, bleu: 25.502996135441286 ft: 100%|█████████████████████████████████████| 122/122 [00:07<00:00, 15.83it/s] engine: ft, batch: 8, cps:15264.420564754077, elapsed: 7.707924s, bleu: 25.52744772193587
Can you reproduce the issue by the examples of repo? Besides, it is strange that the scores of triton and pytorch are different. They are fully same in our examples.
For gemm test, it only brings benefit when the performance of default gemm algo and best gemm algo are difference. In most cases, using gemm test or not may not affect the performance a lot.
Can you reproduce the issue by the examples of repo? Besides, it is strange that the scores of triton and pytorch are different. They are fully same in our examples.
For gemm test, it only brings benefit when the performance of default gemm algo and best gemm algo are difference. In most cases, using gemm test or not may not affect the performance a lot.
@byshiue I can reproduce the issue, and if necessary, I can provide you with test samples, test code and model, please let me know your email address
Can you reproduce the issue by the examples of repo? Besides, it is strange that the scores of triton and pytorch are different. They are fully same in our examples. For gemm test, it only brings benefit when the performance of default gemm algo and best gemm algo are difference. In most cases, using gemm test or not may not affect the performance a lot.
@byshiue I can reproduce the issue, and if necessary, I can provide you with test samples, test code and model, please let me know your email address
You can share the reproduced step by the examples we provide here.
Can you reproduce the issue by the examples of repo? Besides, it is strange that the scores of triton and pytorch are different. They are fully same in our examples. For gemm test, it only brings benefit when the performance of default gemm algo and best gemm algo are difference. In most cases, using gemm test or not may not affect the performance a lot.
@byshiue I can reproduce the issue, and if necessary, I can provide you with test samples, test code and model, please let me know your email address
You can share the reproduced step by the examples we provide here.
ok i try
Can you reproduce the issue by the examples of repo? Besides, it is strange that the scores of triton and pytorch are different. They are fully same in our examples. For gemm test, it only brings benefit when the performance of default gemm algo and best gemm algo are difference. In most cases, using gemm test or not may not affect the performance a lot.
@byshiue I can reproduce the issue, and if necessary, I can provide you with test samples, test code and model, please let me know your email address
You can share the reproduced step by the examples we provide here.
[the examples we provide here] which examples are referred to, I do not quite understand
Can you reproduce the issue by the examples of repo? Besides, it is strange that the scores of triton and pytorch are different. They are fully same in our examples. For gemm test, it only brings benefit when the performance of default gemm algo and best gemm algo are difference. In most cases, using gemm test or not may not affect the performance a lot.
@byshiue I can reproduce the issue, and if necessary, I can provide you with test samples, test code and model, please let me know your email address
You can share the reproduced step by the examples we provide here.
[the examples we provide here] which examples are referred to, I do not quite understand
You can use translate_example.py of FasterTransformer, and t5_end_to_end_test.py of fastertransformer_backend.
@byshiue
You can use translate_example.py of FasterTransformer, and t5_end_to_end_test.py of fastertransformer_backend.
The precision is fp16
ouput of translate_example.py of FasterTransformer: [INFO] MPI is not available in this PyTorch build. [INFO] MPI is not available in this PyTorch build. [INFO] MPI is not available in this PyTorch build. [INFO] MPI is not available in this PyTorch build. 2023-01-10 12:35:33,508 main [INFO] bleu score: 26.21 2023-01-10 12:35:33,508 main [INFO] bleu counts: [36344, 19103, 11208, 6847] 2023-01-10 12:35:33,508 main [INFO] bleu totals: [62577, 59573, 56569, 53565] 2023-01-10 12:35:33,508 main [INFO] bleu precisions: [58.07884686066766, 32.06654021116949, 19.81297176899008, 12.782600578736115] 2023-01-10 12:35:33,508 main [INFO] bleu sys_len: 62577; ref_len: 61287 2023-01-10 12:35:50,571 main [INFO] bleu score: 26.41 2023-01-10 12:35:50,571 main [INFO] bleu counts: [36146, 18978, 11135, 6790] 2023-01-10 12:35:50,571 main [INFO] bleu totals: [61753, 58749, 55745, 52741] 2023-01-10 12:35:50,571 main [INFO] bleu precisions: [58.53318867099574, 32.30352857069908, 19.97488563996771, 12.874234466544054] 2023-01-10 12:35:50,571 main [INFO] bleu sys_len: 61753; ref_len: 61287 2023-01-10 12:39:39,900 main [INFO] bleu score: 17.93 2023-01-10 12:39:39,900 main [INFO] bleu counts: [31162, 13693, 7037, 3761] 2023-01-10 12:39:39,900 main [INFO] bleu totals: [62104, 59100, 56096, 53092] 2023-01-10 12:39:39,900 main [INFO] bleu precisions: [50.17712224655417, 23.169204737732656, 12.544566457501427, 7.083929782264748] 2023-01-10 12:39:39,900 main [INFO] bleu sys_len: 62104; ref_len: 61287 2023-01-10 12:39:55,028 main [INFO] bleu score: 17.64 2023-01-10 12:39:55,028 main [INFO] bleu counts: [30650, 13361, 6897, 3743] 2023-01-10 12:39:55,028 main [INFO] bleu totals: [62098, 59094, 56090, 53086] 2023-01-10 12:39:55,028 main [INFO] bleu precisions: [49.35746722921833, 22.60974041357837, 12.296309502585132, 7.050823192555476] 2023-01-10 12:39:55,028 main [INFO] bleu sys_len: 62098; ref_len: 61287 2023-01-10 12:39:55,031 main [INFO] hf-beamsearch translates 94 batches taking 372.09 sec to translate 101007 tokens, BLEU score: 26.21, 271 tokens/sec. (62577 words, 168 words/sec) 2023-01-10 12:39:55,032 main [INFO] ft-beamsearch translates 94 batches taking 13.74 sec to translate 98938 tokens, BLEU score: 26.41, 7203 tokens/sec. (61753 words, 4496 words/sec) 2023-01-10 12:39:55,032 main [INFO] hf-sampling translates 94 batches taking 199.64 sec to translate 100897 tokens, BLEU score: 17.93, 505 tokens/sec. (62104 words, 311 words/sec) 2023-01-10 12:39:55,032 main [INFO] ft-sampling translates 94 batches taking 12.00 sec to translate 101637 tokens, BLEU score: 17.64, 8473 tokens/sec. (62098 words, 5177 words/sec)
t5_end_to_end_test.py result: bleu score: 25.36 bleu counts: [35704, 18414, 10664, 6414] bleu totals: [62034, 59030, 56026, 53022] bleu precisions: [57.55553406196602, 31.19430797899373, 19.03401991932317, 12.096865452076496] bleu sys_len: 62034; ref_len: 61287 [INFO] ft_triton translates 94 batches taking 18.65 sec to translate 62034 tokens, BLEU score: 25.36, 3327 tokens/sec.
@byshiue
You can use translate_example.py of FasterTransformer, and t5_end_to_end_test.py of fastertransformer_backend.
The precision is fp16
ouput of translate_example.py of FasterTransformer: [INFO] MPI is not available in this PyTorch build. [INFO] MPI is not available in this PyTorch build. [INFO] MPI is not available in this PyTorch build. [INFO] MPI is not available in this PyTorch build. 2023-01-10 12:35:33,508 main [INFO] bleu score: 26.21 2023-01-10 12:35:33,508 main [INFO] bleu counts: [36344, 19103, 11208, 6847] 2023-01-10 12:35:33,508 main [INFO] bleu totals: [62577, 59573, 56569, 53565] 2023-01-10 12:35:33,508 main [INFO] bleu precisions: [58.07884686066766, 32.06654021116949, 19.81297176899008, 12.782600578736115] 2023-01-10 12:35:33,508 main [INFO] bleu sys_len: 62577; ref_len: 61287 2023-01-10 12:35:50,571 main [INFO] bleu score: 26.41 2023-01-10 12:35:50,571 main [INFO] bleu counts: [36146, 18978, 11135, 6790] 2023-01-10 12:35:50,571 main [INFO] bleu totals: [61753, 58749, 55745, 52741] 2023-01-10 12:35:50,571 main [INFO] bleu precisions: [58.53318867099574, 32.30352857069908, 19.97488563996771, 12.874234466544054] 2023-01-10 12:35:50,571 main [INFO] bleu sys_len: 61753; ref_len: 61287 2023-01-10 12:39:39,900 main [INFO] bleu score: 17.93 2023-01-10 12:39:39,900 main [INFO] bleu counts: [31162, 13693, 7037, 3761] 2023-01-10 12:39:39,900 main [INFO] bleu totals: [62104, 59100, 56096, 53092] 2023-01-10 12:39:39,900 main [INFO] bleu precisions: [50.17712224655417, 23.169204737732656, 12.544566457501427, 7.083929782264748] 2023-01-10 12:39:39,900 main [INFO] bleu sys_len: 62104; ref_len: 61287 2023-01-10 12:39:55,028 main [INFO] bleu score: 17.64 2023-01-10 12:39:55,028 main [INFO] bleu counts: [30650, 13361, 6897, 3743] 2023-01-10 12:39:55,028 main [INFO] bleu totals: [62098, 59094, 56090, 53086] 2023-01-10 12:39:55,028 main [INFO] bleu precisions: [49.35746722921833, 22.60974041357837, 12.296309502585132, 7.050823192555476] 2023-01-10 12:39:55,028 main [INFO] bleu sys_len: 62098; ref_len: 61287 2023-01-10 12:39:55,031 main [INFO] hf-beamsearch translates 94 batches taking 372.09 sec to translate 101007 tokens, BLEU score: 26.21, 271 tokens/sec. (62577 words, 168 words/sec) 2023-01-10 12:39:55,032 main [INFO] ft-beamsearch translates 94 batches taking 13.74 sec to translate 98938 tokens, BLEU score: 26.41, 7203 tokens/sec. (61753 words, 4496 words/sec) 2023-01-10 12:39:55,032 main [INFO] hf-sampling translates 94 batches taking 199.64 sec to translate 100897 tokens, BLEU score: 17.93, 505 tokens/sec. (62104 words, 311 words/sec) 2023-01-10 12:39:55,032 main [INFO] ft-sampling translates 94 batches taking 12.00 sec to translate 101637 tokens, BLEU score: 17.64, 8473 tokens/sec. (62098 words, 5177 words/sec)
t5_end_to_end_test.py result: bleu score: 25.36 bleu counts: [35704, 18414, 10664, 6414] bleu totals: [62034, 59030, 56026, 53022] bleu precisions: [57.55553406196602, 31.19430797899373, 19.03401991932317, 12.096865452076496] bleu sys_len: 62034; ref_len: 61287 [INFO] ft_triton translates 94 batches taking 18.65 sec to translate 62034 tokens, BLEU score: 25.36, 3327 tokens/sec.
Please provide the end-to-end reproduced steps, including the docker image, the arguments to run the examples. Thank you. The BLEU scores on triton and pytorch are different. That means that your tests are not on same page.
@byshiue
Please provide the end-to-end reproduced steps, including the docker image, the arguments to run the examples. Thank you. The BLEU scores on triton and pytorch are different. That means that your tests are not on same page.
ok ok, translate_example.py of FasterTransformer steps:
t5_end_to_end_test.py result:
cd /workspace/build/fastertransformer_backend/ python3 tools/t5_utils/t5_end_to_end_test.py --batch_size 32
Please try both cases on 22.12
docker image to make sure they use same version of CUDA. And the gemm test you run in triton does not match your case. It should be
./bin/t5_gemm 32 1 128 512 8 64 2048 512 8 64 2048 32128 1 1 0
cuda version: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0
step:
/workspace/build/fastertransformer_backend/build/bin/t5_gemm 32 1 128 512 8 64 2048 512 8 64 2048 32128 1 1 0 CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root /opt/tritonserver/bin/tritonserver \ --model-repository=${WORKSPACE}/all_models/t5/ &
cd /workspace/build/fastertransformer_backend/ python3 tools/t5_utils/t5_end_to_end_test.py --batch_size 32
result:
bleu score: 25.37
bleu counts: [35703, 18416, 10672, 6419]
bleu totals: [62018, 59014, 56010, 53006]
bleu precisions: [57.568770356993134, 31.206154471820245, 19.053740403499376, 12.10994981700185]
bleu sys_len: 62018; ref_len: 61287
[INFO] ft_triton translates 94 batches taking 18.21 sec to translate 62018 tokens, BLEU score: 25.37, 3406 tokens/sec.
Please try both cases on 22.12 docker image to make sure they use same version of CUDA. Also, please provide the GPU you use, and try to reproduce your issue on different machines and different GPUs to claim it is a common issue.
@byshiue Is it possible that it has something to do with the gpu driver version, my driver version is relatively low 460.73.01
@byshiue Is it possible that it has something to do with the gpu driver version, my driver version is relatively low 460.73.01
I am not really sure. Environment, driver and CUDA version all have some impact. So, it is better to reproduce on several environment.
@byshiue Is it possible that it has something to do with the gpu driver version, my driver version is relatively low 460.73.01
I am not really sure. Environment, driver and CUDA version all have some impact. So, it is better to reproduce on several environment.
ok I will test again using the same environment
@byshiue i run in gpu dirver 470.82, The speed is normal, consistent with pytorch, thank you very much
Description
Reproduced Steps