huggingface / optimum

🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
https://huggingface.co/docs/optimum/main/
Apache License 2.0
2.42k stars 423 forks source link

ORTForModelVision2Seq is running very slow on onnxruntime than the pytorch version #1337

Open aftnix opened 11 months ago

aftnix commented 11 months ago

System Info

- `optimum` version: 1.12.0
- `transformers` version: 4.32.1
- Platform: Linux-4.14.318-167.530.amzn1.x86_64-x86_64-with-glibc2.10
- Python version: 3.8.13
- Huggingface_hub version: 0.16.4
- PyTorch version (GPU?): 1.13.0a0+d321be6 (cuda available: True)

Who can help?

@JingyaHuang @echarlaix

Information

Tasks

Reproduction (minimal, reproducible, runnable)

from transformers import AutoTokenizer
import torch
import re
from transformers import DonutProcessor, VisionEncoderDecoderModel
from datasets import load_dataset
import huggingface_hub
import time
from optimum.onnxruntime import ORTModelForVision2Seq
from PIL import Image
import onnxruntime as onnxrt

#model_info = huggingface_hub.model_info("naver-clova-ix/donut-base-finetuned-cord-v2")

hf_model = VisionEncoderDecoderModel.from_pretrained(
    "eng_donut"
)

hf_model.to('cuda')
onnx_dir = "eng_donut_onnx"
tokenizer = AutoTokenizer.from_pretrained(onnx_dir)
session_options = onnxrt.SessionOptions()
session_options.log_severity_level = 1
session_options.enable_profiling = True
session_options.graph_optimization_level = onnxrt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
onnx_model = ORTModelForVision2Seq.from_pretrained(onnx_dir,
        session_options,
        use_io_binding=False,
        from_transformers=False,
        provider="CUDAExecutionProvider", 
        provider_options={"cudnn_conv_algo_search": "DEFAULT"})
onnx_model.to('cuda')
onnx_model.name_or_path = onnx_dir
processor = DonutProcessor.from_pretrained(onnx_dir)

dataset = load_dataset("hf-internal-testing/example-documents", split="test")
#image = dataset[2]["image"]
image = Image.open('0114.jpg')
task_prompt = "<s_cord-v2>"
decoder_input_ids = processor.tokenizer(
    task_prompt, add_special_tokens=False, return_tensors="pt"
).input_ids

pixel_values = processor(image, return_tensors="pt").pixel_values

start = time.time()
outputs = hf_model.generate(
    pixel_values.to("cuda"),
    decoder_input_ids=decoder_input_ids.to("cuda"),
    max_length=hf_model.decoder.config.max_position_embeddings,
    early_stopping=True,
    pad_token_id=processor.tokenizer.pad_token_id,
    eos_token_id=processor.tokenizer.eos_token_id,
    use_cache=True,
    num_beams=1,
    bad_words_ids=[[processor.tokenizer.unk_token_id]],
    return_dict_in_generate=True,
)
print(f'Time HF: {time.time() - start}')
hf_sequence = processor.batch_decode(outputs.sequences)[0]
hf_sequence = hf_sequence.replace(processor.tokenizer.eos_token, "").replace(
    processor.tokenizer.pad_token, ""
)
hf_sequence = re.sub(
    r"<.*?>", "", hf_sequence, count=1
).strip()  # remove first task start token
print(processor.token2json(hf_sequence))

start = time.time()
generated_ids = onnx_model.generate(
    pixel_values.to("cuda"),
    decoder_input_ids=decoder_input_ids.to("cuda"),
    max_length=hf_model.decoder.config.max_position_embeddings,
    early_stopping=True,
    pad_token_id=processor.tokenizer.pad_token_id,
    eos_token_id=processor.tokenizer.eos_token_id,
    use_cache=True,
    num_beams=2,
    bad_words_ids=[[processor.tokenizer.unk_token_id]],
    return_dict_in_generate=True,
)
print(f'Time ONNX: {time.time() - start}')
onnx_sequence = processor.batch_decode(generated_ids.sequences)[0]
onnx_sequence = onnx_sequence.replace(processor.tokenizer.eos_token, "").replace(
    processor.tokenizer.pad_token, ""
)
onnx_sequence = re.sub(
    r"<.*?>", "", onnx_sequence, count=1
).strip()  # remove first task start token
print(processor.token2json(onnx_sequence))

Expected behavior

Time HF: 1.4525291919708252 Time ONNX: 15.614277839660645

I tried using use_io_binding=True, but it resulted in the following error :

Traceback (most recent call last):
  File "donut_onnx_test2.py", line 71, in <module>
    generated_ids = onnx_model.generate(
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/generation/utils.py", line 1486, in generate
    model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(
  File "/opt/conda/lib/python3.8/site-packages/transformers/generation/utils.py", line 655, in _prepare_encoder_decoder_kwargs_for_generation
    model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/optimum/onnxruntime/base.py", line 68, in __call__
    return self.forward(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/optimum/onnxruntime/modeling_seq2seq.py", line 364, in forward
    io_binding, output_shapes, output_buffers = self.parent_model._prepare_io_binding(
  File "/opt/conda/lib/python3.8/site-packages/optimum/onnxruntime/modeling_ort.py", line 798, in _prepare_io_binding
    output_buffer = self._prepare_output_buffer(model, output_shape, output_name)
  File "/opt/conda/lib/python3.8/site-packages/optimum/onnxruntime/modeling_ort.py", line 673, in _prepare_output_buffer
    output_buffer = torch.empty(np.prod(output_shape), dtype=torch_type, device=self.device).contiguous()
  File "<__array_function__ internals>", line 180, in prod
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 3088, in prod
    return _wrapreduction(a, np.multiply, 'prod', axis, dtype, out,
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
numpy.core._exceptions.UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U23'), dtype('<U23')) -> None

Also Two common recurring warnings in the log :

2023-09-03 12:26:27.769412851 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'onnx::ConstantOfShape_533'. It is not used by any node and should be removed from the model. 2023-09-03 12:26:27.769473537 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'onnx::Mul_535'. It is not used by any node and should be removed from the model. 2023-09-03 12:26:27.769796840 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'onnx::ConstantOfShape_566'. It is not used by any node and should be removed from the model. 2023-09-03 12:26:27.769833388 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'onnx::ConstantOfShape_541'. It is not used by any node and should be removed from the model. 2023-09-03 12:26:27.769855381 [W:onnxruntime:, graph.cc:3543 CleanUnusedInitializersAndNodeArgs] Removing initializer 'onnx::Mul_543'. It is not used by any node and should be removed from the model. ........

And

2023-09-03 12:26:37.454112714 [W:onnxruntime:, session_state.cc:1169 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.

aftnix commented 11 months ago

Tagging @mht-sharma

My onnxruntime logs show Nodes are placed in CUDA but still we're getting abysmal performance.

root@940c820c2670:/workspace# cat trace.log | grep CUDA
2023-09-04 08:40:50.219068761 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp]  Node(s) placed on [CUDAExecutionProvider]. Number of nodes: 3921
2023-09-04 08:40:53.954745246 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp]  Node(s) placed on [CUDAExecutionProvider]. Number of nodes: 516
2023-09-04 08:41:00.121448966 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp]  Node(s) placed on [CUDAExecutionProvider]. Number of nodes: 3921
2023-09-04 08:41:03.378257766 [V:onnxruntime:, session_state.cc:1155 VerifyEachNodeIsAssignedToAnEp]  Node(s) placed on [CUDAExecutionProvider]. Number of nodes: 516
fxmarty commented 11 months ago

@aftnix Thank you for the report. IO Binding support will come with https://github.com/huggingface/optimum/pull/1209.

I can not reproduce the x11 slowdown you have. In my test, our ORT implementation is indeed slower than PyTorch with CUDA EP (but faster on CPU EP).

There are several issues in your script:

Script:

from transformers import AutoTokenizer
import torch
from transformers import DonutProcessor, VisionEncoderDecoderModel
import time
from optimum.onnxruntime import ORTModelForVision2Seq
from PIL import Image
import onnxruntime as onnxrt

device = "cuda"
hf_model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2").to(device)

onnx_dir = "/path/to/donut_onnx"
tokenizer = AutoTokenizer.from_pretrained(onnx_dir)
#session_options = onnxrt.SessionOptions()
#session_options.log_severity_level = 1
#session_options.enable_profiling = True
#session_options.graph_optimization_level = onnxrt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
onnx_model = ORTModelForVision2Seq.from_pretrained(onnx_dir, use_io_binding=False, provider="CPUExecutionProvider")
processor = DonutProcessor.from_pretrained(onnx_dir)

image = Image.open('/path/to/example_donut.jpg')
task_prompt = "<s_cord-v2>"
decoder_input_ids = processor.tokenizer(
    task_prompt, add_special_tokens=False, return_tensors="pt"
).input_ids

decoder_input_ids = decoder_input_ids.to(device)

pixel_values = processor(image, return_tensors="pt").pixel_values.to(device)

# warmup
outputs = hf_model.generate(
    pixel_values,
    decoder_input_ids=decoder_input_ids,
    max_new_tokens=40,
    use_cache=True,
    num_beams=1,
)

start = time.time()
for i in range(5):
    outputs = hf_model.generate(
        pixel_values,
        decoder_input_ids=decoder_input_ids,
        max_new_tokens=40,
        use_cache=True,
        num_beams=1,
    )
end = time.time()
print(processor.batch_decode(outputs))
print(f"Time torch: {end - start} s")

# warmup
generated_ids = onnx_model.generate(
    pixel_values,
    decoder_input_ids=decoder_input_ids,
    max_new_tokens=40,
    use_cache=True,
    num_beams=1,
)

start = time.time()
for i in range(5):
    generated_ids = onnx_model.generate(
        pixel_values,
        decoder_input_ids=decoder_input_ids,
        max_new_tokens=40,
        use_cache=True,
        num_beams=1,
    )
end = time.time()
print(processor.batch_decode(outputs))
print(f'Time ONNX: {end - start} s')

With this script I get

['<s_cord-v2><s_menu><s_nm> Market Share Infographics</s_nm><s_unitprice> Rupiter</s_unitprice><s_cnt> 1</s_cnt><s_price> Despite being red,</s_nm><s_unitprice> a gas giant and</s_unitprice><s_cnt> 1</s_cnt><s_price> the biggest planet in</s_price></s_menu></s>']
Time torch: 1.565211296081543 s
['<s_cord-v2><s_menu><s_nm> Market Share Infographics</s_nm><s_unitprice> Rupiter</s_unitprice><s_cnt> 1</s_cnt><s_price> Despite being red,</s_nm><s_unitprice> a gas giant and</s_unitprice><s_cnt> 1</s_cnt><s_price> the biggest planet in</s_price></s_menu></s>']
Time ONNX: 4.796400308609009 s

which is obviously not good, I am pretty sure we need IO Binding @mht-sharma @IlyasMoutawwakil

mht-sharma commented 11 months ago

Currently, IOBinding for the class is not supported. Hence u can get ORT inference latency more than the PyTorch.

fxmarty commented 11 months ago

There should be a meaningful error if IO Binding is not supported.

IlyasMoutawwakil commented 11 months ago

with IO binding it's as fast as pytorch. we might need to investigate optimizations.

aftnix commented 11 months ago
Screenshot 2023-09-05 at 12 16 38 PM

The trace seems to suggest gpu is mostly busy running model_run and Sequential Executor

Some sample from the trace :

{"cat" : "Session","pid" :808,"tid" :808,"dur" :221612,"ts" :10789290,"ph" : "X","name" :"model_run","args" : {}}, {"cat" : "Session","pid" :808,"tid" :808,"dur" :191881,"ts" :39598709,"ph" : "X","name" :"SequentialExecutor::Execute","args" : {}},

The top 100 "due" ops are either model_run/Sequential_executors . I left out the model_load/session_initialize one as they are one time thing.

Calvinnncy97 commented 10 months ago

Hi, may I know if there is any updates to this? I tried using ORTForModelVision2Seq with Nougat. However, its speed is half the speed of Pytorch.