huggingface / optimum-intel

🤗 Optimum Intel: Accelerate inference with Intel optimization tools
https://huggingface.co/docs/optimum/main/en/intel/index
Apache License 2.0
393 stars 111 forks source link

OVModelForSeq2SeqLM with Helsinki-NLP/opus-mt-es-en has slow inference times when exported to OpenVino #339

Open tsmith023 opened 1 year ago

tsmith023 commented 1 year ago

I'm having trouble exporting the Helsinki-NLP/opus-mt-es-en model for language translation into the optimised OpenVino IR format. Reading through the other issues within this repository highlighted this issue https://github.com/huggingface/optimum-intel/issues/188, which seems to suffer from similar effects.

In that case, it seemed to be an issue with the BigBird architecture and its lack of support by HuggingFace Optimum. However, the Helsinki-NLP/opus-mt-es-en model is of the MarianMT class, which is documented as being supported.

Am I missing something here fundamental? Is the conversion of the MarianMT model into OpenVino IR format currently unsupported by this library in a similar way to the BigBird models as in the above issue? Or are there aspects of the conversion that I am not specifying correctly such that the export is sub-optimal? It would seem that this should be possible given the documentation.

I see the following during the build logs if it helps at all: Asked a sequence length of 16, but a sequence length of 1 will be used with use_past == True for 'decoder_input_ids'.

An MRE looks like:

import os
from optimum.intel.openvino import OVModelForSeq2SeqLM
from transformers import AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-es-en")
ov_model = OVModelForSeq2SeqLM.from_pretrained(
    "Helsinki-NLP/opus-mt-es-en",
    export=True,
    use_cache=True,
)

def run(text: str):
    pipe = pipeline("translation_es_to_en", model=ov_model, tokenizer=tokenizer)
    return pipe(text)

def export_to_ov(save_dir: str):
    ov_model.save_pretrained()

if __name__ == "__main__":
    export_to_ov("./")

Operating run("Hola, como estas?") yields an inference time of 0.6323761940002441s while using the exported OVIR binaries in an OVMS model pipeline yields an inference time of 45s.

Any help on this one would be greatly appreciated, cheers!

P.S. I can post the config.json file being passed to the OVMS instance, but it's very long so I'll leave it until it's required!

echarlaix commented 1 year ago

Hi @tsmith023,

Apologies for the late reply, yes MarianMT models are supported. Concerning the slow inference you're reporting, are you comparing the resulting OpenVINO model with the original PyTorch model and currently finding that the latency from the OpenVINO model is higher?

I'm not able to reproduce this, could you confirm that you're still observing it with :

import time
import torch
from optimum.intel import OVModelForSeq2SeqLM
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "Helsinki-NLP/opus-mt-es-en"
ov_model = OVModelForSeq2SeqLM.from_pretrained(model_id, export=True, use_cache=True)
torch_model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokens = tokenizer("This is a sample input", return_tensors="pt")
decoder_inputs = {"decoder_input_ids": torch.ones((1, 1), dtype=torch.long) * torch_model.config.decoder_start_token_id }

def elapsed_time(model, nb_pass=20):
    start = time.time()
    for _ in range(nb_pass):
        model(**tokens, **decoder_inputs)
    end = time.time()
    return (end - start) / nb_pass

# warmup
elapsed_time(ov_model, nb_pass=5)

time_ov = elapsed_time(ov_model)
time_torch = elapsed_time(torch_model)
tsmith023 commented 1 year ago

Hi @echarlaix, the problem didn't surface when executing within the Python runtime but when running the exported OVIR binaries within OpenVino itself, which is a C++ runtime. I was comparing the performance of the exported model within the Python runtime to its performance within the C++ runtime

Do you feel that this issue is better suited to the OpenVino repository? I raised it originally here since I judged it to be a problem with the model exporting logic. Let me know whether I should relocate it there or whether you feel there is an implementation issue here 😁

@pbebbo