Open oborchers opened 3 years ago
Hello, Thanks for you work on that @oborchers ! I also saw the notebook on SentenceTransformers and it helped a lot !
Any status about this feature ? I also need to run models in onnx but most of them need to call a .generate
function which is for now not supported... (I could replicate all the generate code in nodejs but i'm sure there is a nicer solution)
Is there any fix, status update or hack ?
Thanks a lot in advance, Have a great day.
Hi @lerezell! Welcome! Glad the notebook helped.
Not from my side, unfortunately. I had postponed the issue from our side, because we had more pressing stuff to work on. But lately, the need for this feature starts to become larger as well at our company. Does your solution of porting the .generate
function work by placing it on top of the ONNX version?
Edit:
Just out of curiosity I went through the .generate
code and it should be possible to place the existing .generate
code on top of an CausalLMOutput
model, very similar as done in the notebook. This requires an extension of the forward method.
In an initial implementation, it should be perfectly sufficient to port just the sample
section and see if it works. However, this does not necessarily apply to beam_search, which I haven't figured out how it works. And the raw implementation shouldn't be too complex, because one might strip away a set of the convenience functions/arguments.
Downsides of this are, that, there needs to be some way of defining the arguments of .generate
at runtime for inference. For example, the min_length
and max_length
and eos_token_id
parameter should be included in the forward
method arguments, because otherwise they would be static and defined via configuration at runtime. This may be sensible for some applications, but requires re-exporting the model every-time those change, which isn't really a nice way of doing this. Or at least if I didn't miss something completely
Best regards and have a nice eastern
Hi @oborchers,
I still haven't implemented "my" solution as I wanted to know if there was any other solution than writing all the logic again. I would rather not and exporting the logic in the forward (and then in the onnx model) seems to be the best solution.
For the x_length
arguments, that a downside, passing them as optional in the forward method could do ?
I need to focus on other things right now but I definitely keep an eye open for that !
Have a great day
Hi, any update on how to export full pipelines to onnx?
For now, we're still obliged to keep a custom/hugging face lib code to handle the "post output embeddings" logic....
Thanks in advance, Have a great day
Hi @Ierezell!
Sorry for not coming back on the issue. To be honest, for our use case there are quite a few problems we've encountered in exporting full pipelines to ONNX:
past_key_values
)past_key_values
This blog post from Microsoft may help though:
Hi @oborchers, thanks a lot for the feedback!
Onnx is nice to be able to change stack for me (javascript etc...) but in the light of what you're saying it will be better to keep my GPU inference server.
Thanks a lot, Have a great day !
Hi,
Is there an alternative to onnx that you'd recommend? The able to keep and manipulate past_key_values is the most crucial part that I cannot find for many inference optimizations.
Thank you!
🚀 Feature request
I am requesting to provide an option to specify the output format for the
translation_xx_to_yy
export to ONNX models. Currently, the output of convert_graph_to_onnx.convert will provide the raw tensors as output (working prototype code under #9722)Motivation
When putting the models into production it would be great if one could chose, whether one wants to have the actual tensors or the output-tokens returned when exporting a translation pipeline to ONNX. Thereby, one is not forced to do a custom re-implementation of the model.generate function, which then uses the ONNX model instead of the torch one.
As for now, the part which is could be replaced by an ONNX inference session lives under the model.generate function. Using this in production would mean to keep a TranslationPipeline object with all corresponding model information and config plus an ONNX inference session.
Your contribution
There may be multiple solutions to this problem:
User-specific re-implementation of model.generate (This is what Ill try to accomplish in the future)
Is it possible to rewrite the code under model.generate to full torch? Then it should be possible to create a custom model for all translation models, that just places this "generate layer" on top of it. I have provided an example here which adds a simple pooling layer on an already extant transformers model. (That would require more study from my side to develop a prototype and follows step 1)
Provide support for the ort-customops library by Microsoft. Essentially, this enables ONNX to handle strings (but introduces dependency to a very experimental extension). For example, that way one can export the universal sentence encoder (including tokenizer) to ONNX. Example here. I cannot provide anything useful here.