Open carter54 opened 3 years ago
@carter54,
I suggest to use convert_to_onnx to output the optimized fp16 GPT-2 model directly like
python convert_to_onnx.py -m gpt2 --model_class GPT2LMHeadModel --output gpt2.onnx -p fp16 -o --use_gpu
I tried benchmark in a V100 machine (I changed sequence_length=1 to sequence_length=200 in benchmark_gpt2.py in the master branch):
python benchmark_gpt2.py -m gpt2 --model_class GPT2LMHeadModel --test_times 100 -o --use_gpu -p fp16 -b 5 -s 0
The output is like the following:
Arguments:Namespace(batch_sizes=[5], cache_dir='./cache_models', include_copy_output_latency=False, model_class='GPT2LMHeadModel', model_name_or_path='gpt2', onnx_dir='./onnx_models', optimize_onnx=True, past_sequence_lengths=[0], precision=<Precision.FLOAT16: 'fp16'>, result_csv=None, test_times=100, thread_num=-1, torchscript=False, use_gpu=True, validate_onnx=False, verbose=False)
ATen/Parallel:
at::get_num_threads() : 24
at::get_num_interop_threads() : 12
OpenMP 201511 (a.k.a. OpenMP 4.5)
omp_get_max_threads() : 24
Intel(R) Math Kernel Library Version 2020.0.1 Product Build 20200208 for Intel(R) 64 architecture applications
mkl_get_max_threads() : 24
Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
std::thread::hardware_concurrency() : 24
Environment variables:
OMP_NUM_THREADS : 16
MKL_NUM_THREADS : [not set]
ATen parallel backend: OpenMP
PyTorch Version:1.6.0
Transformers Version:3.1.0
Onnxruntime Version:1.5.2
Shapes: input_ids=torch.Size([1, 1]) past=torch.Size([2, 1, 12, 1, 64]) output=torch.Size([1, 1, 50257]) present=torch.Size([2, 1, 12, 2, 64])
/bert_ort/tlwu/py36/lib/python3.6/site-packages/transformers/modeling_gpt2.py:558: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert batch_size > 0, "batch_size has to be defined and > 0"
/bert_ort/tlwu/py36/lib/python3.6/site-packages/transformers/modeling_gpt2.py:165: TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
w = w / (float(v.size(-1)) ** 0.5)
/bert_ort/tlwu/py36/lib/python3.6/site-packages/transformers/modeling_gpt2.py:170: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
mask = self.bias[:, :, ns - nd : ns, :ns]
Fused LayerNormalization count: 25
Fused FastGelu count: 12
Removed Reshape and Expand count: 0
Fused Attention(with past) count: 12
Graph pruned: 0 inputs, 0 outputs and 741 nodes are removed
Graph pruned: 0 inputs, 0 outputs and 312 nodes are removed
postprocess: remove Reshape count:48
Fused FastGelu(add bias) count: 12
opset verion: 11
Output model to ./onnx_models/gpt2_past_fp16.onnx
batch_size=5, past_sequence_length=0, torch_latency=64.92, ort_latency=92.17, ort_io_latency=17.29
In my test, the latency with IO Binding is 17ms, and without IO Binding is 92ms.
@tianleiwu Thx for the reply. I can reproduce the similar result with yours now with this:
python benchmark_gpt2.py -m gpt2 --model_class GPT2LMHeadModel --test_times 100 -o --use_gpu -p fp16 -b 5 -s 0
But following error appeared when I run
python benchmark_gpt2.py -m gpt2 --model_class GPT2LMHeadModel --test_times 100 -o --use_gpu -p fp16 -b 5 -s 200
after changing '-s 0' to '-s 200'
Arguments:Namespace(batch_sizes=[5], cache_dir='./cache_models', include_copy_output_latency=False, model_class='GPT2LMHeadModel', model_name_or_path='/home/hr/PycharmProjects/test/', onnx_dir='./onnx_models', optimize_onnx=True, past_sequence_lengths=[200], precision=<Precision.FLOAT16: 'fp16'>, result_csv=None, test_times=100, thread_num=-1, torchscript=False, use_gpu=True, validate_onnx=False, verbose=False)
ATen/Parallel:
at::get_num_threads() : 24
at::get_num_interop_threads() : 12
OpenMP 201511 (a.k.a. OpenMP 4.5)
omp_get_max_threads() : 24
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
mkl_get_max_threads() : 24
Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
std::thread::hardware_concurrency() : 24
Environment variables:
OMP_NUM_THREADS : [not set]
MKL_NUM_THREADS : [not set]
ATen parallel backend: OpenMP
PyTorch Version:1.7.1
Transformers Version:3.1.0
Onnxruntime Version:1.5.2
/home/hr/anaconda3/lib/python3.8/site-packages/transformers/modeling_gpt2.py:710: FutureWarning: The `past` argument is deprecated and will be removed in a future version, use `past_key_values` instead.
warnings.warn(
Shapes: input_ids=torch.Size([1, 1]) past=torch.Size([2, 1, 12, 1, 64]) output=torch.Size([1, 1, 30000]) present=torch.Size([2, 1, 12, 2, 64])
/home/hr/anaconda3/lib/python3.8/site-packages/transformers/modeling_gpt2.py:558: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert batch_size > 0, "batch_size has to be defined and > 0"
/home/hr/anaconda3/lib/python3.8/site-packages/transformers/modeling_gpt2.py:165: TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
w = w / (float(v.size(-1)) ** 0.5)
/home/hr/anaconda3/lib/python3.8/site-packages/transformers/modeling_gpt2.py:170: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
mask = self.bias[:, :, ns - nd : ns, :ns]
Fused LayerNormalization count: 25
Fused FastGelu count: 12
Fused Attention(with past) count: 12
Graph pruned: 0 inputs, 0 outputs and 741 nodes are removed
Graph pruned: 0 inputs, 0 outputs and 312 nodes are removed
postprocess: remove Reshape count:48
Fused FastGelu(add bias) count: 12
opset verion: 11
Output model to ./onnx_models/model_past_fp16.onnx
Exception
Traceback (most recent call last):
File "benchmark_gpt2.py", line 214, in main
ort_io_outputs, ort_io_latency = Gpt2Helper.onnxruntime_inference_with_binded_io(
File "/home/hr/anaconda3/lib/python3.8/site-packages/onnxruntime/transformers/gpt2_helper.py", line 453, in onnxruntime_inference_with_binded_io
io_binding = Gpt2Helper.prepare_io_binding(ort_session, inputs.input_ids, inputs.position_ids,
File "/home/hr/anaconda3/lib/python3.8/site-packages/onnxruntime/transformers/gpt2_helper.py", line 410, in prepare_io_binding
assert position_ids.is_contiguous()
AssertionError
if I remove line 410 in gpt2_helper.py
assert position_ids.is_contiguous()
this script works fine. I can get following result:
Arguments:Namespace(batch_sizes=[5], cache_dir='./cache_models', include_copy_output_latency=False, model_class='GPT2LMHeadModel', model_name_or_path='/home/hr/PycharmProjects/test/', onnx_dir='./onnx_models', optimize_onnx=True, past_sequence_lengths=[200], precision=<Precision.FLOAT16: 'fp16'>, result_csv=None, test_times=100, thread_num=-1, torchscript=False, use_gpu=True, validate_onnx=False, verbose=False)
ATen/Parallel:
at::get_num_threads() : 24
at::get_num_interop_threads() : 12
OpenMP 201511 (a.k.a. OpenMP 4.5)
omp_get_max_threads() : 24
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
mkl_get_max_threads() : 24
Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
std::thread::hardware_concurrency() : 24
Environment variables:
OMP_NUM_THREADS : [not set]
MKL_NUM_THREADS : [not set]
ATen parallel backend: OpenMP
PyTorch Version:1.7.1
Transformers Version:3.1.0
Onnxruntime Version:1.5.2
/home/hr/anaconda3/lib/python3.8/site-packages/transformers/modeling_gpt2.py:710: FutureWarning: The `past` argument is deprecated and will be removed in a future version, use `past_key_values` instead.
warnings.warn(
Shapes: input_ids=torch.Size([1, 1]) past=torch.Size([2, 1, 12, 1, 64]) output=torch.Size([1, 1, 30000]) present=torch.Size([2, 1, 12, 2, 64])
/home/hr/anaconda3/lib/python3.8/site-packages/transformers/modeling_gpt2.py:558: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert batch_size > 0, "batch_size has to be defined and > 0"
/home/hr/anaconda3/lib/python3.8/site-packages/transformers/modeling_gpt2.py:165: TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
w = w / (float(v.size(-1)) ** 0.5)
/home/hr/anaconda3/lib/python3.8/site-packages/transformers/modeling_gpt2.py:170: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
mask = self.bias[:, :, ns - nd : ns, :ns]
Fused LayerNormalization count: 25
Fused FastGelu count: 12
Fused Attention(with past) count: 12
Graph pruned: 0 inputs, 0 outputs and 741 nodes are removed
Graph pruned: 0 inputs, 0 outputs and 312 nodes are removed
postprocess: remove Reshape count:48
Fused FastGelu(add bias) count: 12
opset verion: 11
Output model to ./onnx_models/test_past_fp16.onnx
batch_size=5, past_sequence_length=200, torch_latency=25.55, ort_latency=33.30, ort_io_latency=4.46
Results are saved to file benchmark_result_20210108-150937.csv
Describe the bug when I load a gpt2 model with onnxruntime-gpu, a lot of warning appeared. It shows that some node will be calculated on CPU. Is this as expected or I made something wrong when converting gpt2 model?
Urgency If there are particular important use cases blocked by this or strict project-related timelines, please share more information and dates. If there are no hard deadlines, please specify none.
System information
To Reproduce I followed these steps to convert gpt2 model from transformers (pytorch version) to onnx:
Inference speed when I set input sequence length 200, batch size 5, output token number 1, no concurrent, no IO binding. No past state is used, which means the inference time includes calculating initial state with 200 input tokens and predict the next token. The average inference speed is about 300ms in a 100 test, longer than I expected...