microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.81k stars 2.94k forks source link

Support for GPT-Neo and GPT-J models #10196

Open mrseeker opened 2 years ago

mrseeker commented 2 years ago

Is your feature request related to a problem? Please describe. I see quite some GPT-2 implementations, but I am missing a GPT-Neo/GPT-J implementation. This is a variant of GPT-2 which is quite similar to GPT-2 in concept, but with much better results compared to GPT-2. Documentation about GPT-Neo can be found here: https://huggingface.co/docs/transformers/model_doc/gpt_neo

System information

Describe the solution you'd like A demo that shows that GPT-Neo and GPT-J models can be used with ONNX to speed up huggingface models.

Describe alternatives you've considered Since GPT-2 with LMhead outputs a similar output as GPTNeoForCausalLM, it should be possible to clone the demo. However, optimization will need to be done differently.

Additional context I am one of the contributors behind KoboldAI, an OSS application that uses GPT-Neo and GPT-J models for creating short novels and text. Am looking for ways to optimize and speed up the models we distribute.

tianleiwu commented 2 years ago

It seems that GPT-Neo uses local causal self attention which is a sliding window where each token can only attend to the previous window_size tokens. Our attention CUDA kernel accepts 4D mask (Bx1xMxM) like in Megatron, where M is the max sequence length.

I think optimization for GPT-Neo could adopt such approach. The optimized graph might be similar to GPT-2 model with 4D mask input. For optimization, need modify graph fusion tool, and see fusion fusion_gpt_attention.py for example. For more information, see dev guide

Contributions are welcome.

isgursoy commented 2 years ago

following

peterwilli commented 2 years ago

I currently run GPT-Neo in ONNX, it took me a lot of time to get it running, and the source files are in this repo: https://github.com/peterwilli/Endless-AWSW/blob/main/EndlessServer/src/onnx_model_manager.py#L52

If anyone's interested I could try to make it more generic, I'm already planning to make a video about all the changes I made as soon as my project has reached a stable status (and is out of the research phase).

CrazyPython commented 2 years ago

@peterwilli I'd be interested to see this be made more generic, ideally adaptable to GPT-J.