Open jenniew opened 1 week ago
My personal take on how to tc might support a broader set of models:
Because the model description is part of the torchchat tree, there's a natural limit to the types of models that can be supported to those that can fit the general infra that torchchat supports.
Of course, the model.py could be made arbitrarily complex, but that doesn't seem desirable. I can see three possible directions: 1 - add additional model-variant.py files for other types. This ultimately triggers the same limitation, because the number of models that may be supported is limited by the number of models distributed. It may also involve rights issues, because some of these models may contain copyrighted or patented portions. 2 - build models from GGUF, following the --gguf-path approach as per docs/GGUF.md 3 - allow users to bring their own model descriptions.
(2) requires gguf import to track new features, and limits models to those supported bu GGUF. (3) allows users to build new models, but requires integration for tokenization and and for export (e.g., the HF cache is at present not exportable via AOTI and/or ET afaik)
Here's an attempt at implementing a solution that allows users to bring their own models (does not support export, and sidesteps the query formatting by adding support for and using pre-tokenized text inputs) for phi-3-mini: https://github.com/mikekg/torchchat/tree/phichat
This introduces an option --cuxtom-builder, which can be using the following invocation:
python torchchat.py generate --custom-builder torchchat/model_python/phi-3-mini.py:model_builder --tokenizer-path /content/torchchat/tokenizer.model --prompt "[32010, 739, 471, 263, 6501, 322, 14280, 29891, 4646, 29892, 322, 32007, 2]"
Example run: https://colab.research.google.com/drive/1HHONUbKqqXU9yU3BIrjH0dRWKdwgY34H?usp=sharing
To make it exportable, we'd want to avoid using components that can't be exported (likely the HF Cache, possibly others), either by changing the source code directly, or using a model rewrited for those components similar to what we use today for quantization in torchchat for aoti & et, or to introduce the et optimization sdpa_with_kv_cache for mobile backends.
Great Question @jenniew.
Like you mentioned, model support is currently biased towards Llama/Transformer architectures, but we intend for the inference pipeline to be built model agnostic. The upcoming models are Llava and Granite Code Models (though both are Transformer based), with Mamba's (SSM) being on my radar.
The ultimate plan is to create a simple interface between Model Definitions (architecture, compile, export) and Inference Pipeline (generate, chat, browser, openai api) such that onboarding becomes easier (e.g. leaning on torchtune for models instead of hosting it ourselves).
@mikekgfb shows a promising approach above as well as mentioning GGUF being an approach.
@jenniew If you have a particular model/architecture/artifact in mind, you can share here or send me a message, and we can give more detailed suggestions
Like @Jack-Khuu mentioned. We need to make some architecture changes and create a model adding flow so it's easy for anyone to add models.
In the meantime, feel free to ask for a specific model.
🚀 The feature, motivation and pitch
I see current torchchat only support a few kinds of model, like llama based(liked) architecture, or pre-defined Transformer architecture models. Is there any plan to support other kinds of model architecture in the future? which kinds of model you're considering to add? If there is a new model whose architecture is not in the supporting list, is there a way to run it?
Alternatives
No response
Additional context
No response
RFC (Optional)
No response