microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime
MIT License
502 stars 126 forks source link

[feature request] builder to expose {GQA, MHA} selection as argument #880

Open BowenBao opened 2 months ago

BowenBao commented 2 months ago

Currently these are inferred from the combination of other configurations such as device and dtype. It is more flexible for downstream users if this can be selected by choice.

baijumeswani commented 2 months ago

What is the advantage of doing it this way? The current process is to take advantage of the fact that the model builder is aware of the attention operator for a specific device and dtype.

Is this for experimentation purposes? If so, maybe we can expose a extra_options flag to override the default attention operator.

BowenBao commented 1 month ago

Hi @baijumeswani, the idea is to decouple the tie of device/dtype with built attention op. Consider custom eps that implements attention op with dtype not supported in ort cpu/cuda.