oborchers commented 3 years ago

🚀 Feature request

I am requesting to provide an option to specify the output format for the translation_xx_to_yy export to ONNX models. Currently, the output of convert_graph_to_onnx.convert will provide the raw tensors as output (working prototype code under #9722)

Motivation

When putting the models into production it would be great if one could chose, whether one wants to have the actual tensors or the output-tokens returned when exporting a translation pipeline to ONNX. Thereby, one is not forced to do a custom re-implementation of the model.generate function, which then uses the ONNX model instead of the torch one.

As for now, the part which is could be replaced by an ONNX inference session lives under the model.generate function. Using this in production would mean to keep a TranslationPipeline object with all corresponding model information and config plus an ONNX inference session.

Your contribution

There may be multiple solutions to this problem:

User-specific re-implementation of model.generate (This is what Ill try to accomplish in the future)
Is it possible to rewrite the code under model.generate to full torch? Then it should be possible to create a custom model for all translation models, that just places this "generate layer" on top of it. I have provided an example here which adds a simple pooling layer on an already extant transformers model. (That would require more study from my side to develop a prototype and follows step 1)
Provide support for the ort-customops library by Microsoft. Essentially, this enables ONNX to handle strings (but introduces dependency to a very experimental extension). For example, that way one can export the universal sentence encoder (including tokenizer) to ONNX. Example here. I cannot provide anything useful here.

ierezell commented 3 years ago

Hello, Thanks for you work on that @oborchers ! I also saw the notebook on SentenceTransformers and it helped a lot !

Any status about this feature ? I also need to run models in onnx but most of them need to call a .generate function which is for now not supported... (I could replicate all the generate code in nodejs but i'm sure there is a nicer solution) Is there any fix, status update or hack ?

Thanks a lot in advance, Have a great day.

oborchers commented 3 years ago

Hi @lerezell! Welcome! Glad the notebook helped.

Not from my side, unfortunately. I had postponed the issue from our side, because we had more pressing stuff to work on. But lately, the need for this feature starts to become larger as well at our company. Does your solution of porting the .generate function work by placing it on top of the ONNX version?

Edit: Just out of curiosity I went through the .generate code and it should be possible to place the existing .generate code on top of an CausalLMOutput model, very similar as done in the notebook. This requires an extension of the forward method.

In an initial implementation, it should be perfectly sufficient to port just the sample section and see if it works. However, this does not necessarily apply to beam_search, which I haven't figured out how it works. And the raw implementation shouldn't be too complex, because one might strip away a set of the convenience functions/arguments.

Downsides of this are, that, there needs to be some way of defining the arguments of .generate at runtime for inference. For example, the min_length and max_length and eos_token_id parameter should be included in the forward method arguments, because otherwise they would be static and defined via configuration at runtime. This may be sensible for some applications, but requires re-exporting the model every-time those change, which isn't really a nice way of doing this. Or at least if I didn't miss something completely

Best regards and have a nice eastern

ierezell commented 3 years ago

Hi @oborchers,

I still haven't implemented "my" solution as I wanted to know if there was any other solution than writing all the logic again. I would rather not and exporting the logic in the forward (and then in the onnx model) seems to be the best solution.

For the x_length arguments, that a downside, passing them as optional in the forward method could do ?

I need to focus on other things right now but I definitely keep an eye open for that !

Have a great day

ierezell commented 3 years ago

Hi, any update on how to export full pipelines to onnx?

For now, we're still obliged to keep a custom/hugging face lib code to handle the "post output embeddings" logic....

Thanks in advance, Have a great day

oborchers commented 3 years ago

Hi @Ierezell!

Sorry for not coming back on the issue. To be honest, for our use case there are quite a few problems we've encountered in exporting full pipelines to ONNX:

How to best deal with caching (past_key_values)
Less than optimal performance when used with some generative models (https://github.com/microsoft/onnxruntime/issues/7238)
The problem of batching requests on inference servers which is very difficult due to the dynamic dimensions of past_key_values
Similar gains in inference time by using custom kernels (e.g. deepspeed inference) + regular pytorch

This blog post from Microsoft may help though:

ierezell commented 3 years ago

Hi @oborchers, thanks a lot for the feedback!

Onnx is nice to be able to change stack for me (javascript etc...) but in the light of what you're saying it will be better to keep my GPU inference server.

Thanks a lot, Have a great day !

johncookds commented 2 years ago

Hi,

Is there an alternative to onnx that you'd recommend? The able to keep and manipulate past_key_values is the most crucial part that I cannot find for many inference optimizations.

Thank you!

huggingface / transformers

Translation Model in ONNX: Choosable Output Formats #9784

🚀 Feature request

Motivation

Your contribution