Open mrseeker opened 2 years ago
It seems that GPT-Neo uses local causal self attention which is a sliding window where each token can only attend to the previous window_size tokens. Our attention CUDA kernel accepts 4D mask (Bx1xMxM) like in Megatron, where M is the max sequence length.
I think optimization for GPT-Neo could adopt such approach. The optimized graph might be similar to GPT-2 model with 4D mask input. For optimization, need modify graph fusion tool, and see fusion fusion_gpt_attention.py for example. For more information, see dev guide
Contributions are welcome.
following
I currently run GPT-Neo in ONNX, it took me a lot of time to get it running, and the source files are in this repo: https://github.com/peterwilli/Endless-AWSW/blob/main/EndlessServer/src/onnx_model_manager.py#L52
If anyone's interested I could try to make it more generic, I'm already planning to make a video about all the changes I made as soon as my project has reached a stable status (and is out of the research phase).
@peterwilli I'd be interested to see this be made more generic, ideally adaptable to GPT-J.
Is your feature request related to a problem? Please describe. I see quite some GPT-2 implementations, but I am missing a GPT-Neo/GPT-J implementation. This is a variant of GPT-2 which is quite similar to GPT-2 in concept, but with much better results compared to GPT-2. Documentation about GPT-Neo can be found here: https://huggingface.co/docs/transformers/model_doc/gpt_neo
System information
Describe the solution you'd like A demo that shows that GPT-Neo and GPT-J models can be used with ONNX to speed up huggingface models.
Describe alternatives you've considered Since GPT-2 with LMhead outputs a similar output as GPTNeoForCausalLM, it should be possible to clone the demo. However, optimization will need to be done differently.
Additional context I am one of the contributors behind KoboldAI, an OSS application that uses GPT-Neo and GPT-J models for creating short novels and text. Am looking for ways to optimize and speed up the models we distribute.