Open larryliu0820 opened 4 weeks ago
I like this direction to modularize the components and sharing them for different use-cases beyond those backed by ET.
It seems like what we could think on further is how to actually modularize it even more, ie. not just reuse the Tokenizer and Sampler, but also most of the runner logic, if possible.
Runner now does some things that are still common for ET or AOTI while preparing the inputs and processing the outputs, eg. the run loop, applying chat pattern with special tokens, etc. So in theory we could have a common Runner pipeline and inject specific minimal components to do the actual forward() call implemented differently for ET or AOTI.
So overall we could have a collection of abstract interfaces for Tokenizer, Sampler, Processor (ie. image processor), Runner (to generate a single next token), Pipeline (to utilize the Runner for completion or multi-turn generation), etc. and also provide some concrete implementations for them, eg. TikTokenTokenizer, ArgMaxSampler, ImageProcessor, LLaMA3Runner, LLaMAChatPipeline, etc. That way we provide some working solution out-the-box, and also allow clients implement their own components by inheriting the interfaces and injecting them instead, eg. a custom SentencePieceTokenizer injected into a custom LLaMA2Runner injected into LLaMAChatPipeline instead of the default LLaMA3Runner.
Such interfaces and default implementations can live in pytorch-labs, if that's a good place for everyone to depend on.
I also really like the idea of unifying these c++ layers! I've been working on extending the existing c++ tokenizer support to handle parts of the tokenizers library from huggingface. I have a draft PR up for discussion, but the code should be very self-contained and portable if the decision is to move this level of support elsewhere.
A couple of key elements of the PR to point out that may be seeds for this discussion:
json
parsing.I'm not at all tied to the specific shapes of these implementations, but wanted to point them out as the discussion touches on how to wrangle these abstractions.
One note I did want to make as I was prototyping (I'll comment on the PR as well): I opted not to use enclosing namespaces
since they were not used in the existing torchchat
c++ layers. I think for a more reusable abstract layer, we would certainly want to use namespace encapsulation.
🚀 The feature, motivation and pitch
Currently torchchat is having its own implementation for these features:
The problem for this design is that it is not bringing in new features checked-in into the export flow in ExecuTorch. What's worse is that the demo apps hosted in torchchat is expecting a
.pte
file from the export flow in ExecuTorch instead of the one from torchchat and that will easily break if changes happen to one or the other.Similar story happens to the C++ implementations of the tokenizers. If we look at tokenizers in ExecuTorch it is a lot similar to what tokenizers in torchchat and the code should be unified to avoid duplication.
Alternatives
An alternative is do nothing. If we keep the status quo, DevX will deteriorate due to constant changes from ExecuTorch that we need to incorporate into torchchat.
Additional context
No response
RFC (Optional)
Proposal
On a high level we want to:
extension/llm
directory.pytorch-labs
.Details
Export flow:
Currently torchchat uses
export.py
to export a model to ET's .pte file. Proposal: fully migrate to ET’sextension/llm
. New dependency: ET nightly build in pip.Runner:
Torchchat C++ runner needs to work for both AOTI and ET so it’s quite complicated. Proposal 1 (preferred):
pytorch-labs
organization, saypytorch-labs/tokenizers
et_run.cpp
andaoti_run.cpp
pytorch-labs/tokenizers
,pytorch-labs/tokenizers
.Proposal 2 (short term?):
Model definition:
Torchchat depends on torchtune for model definition. All the source transformations will come from the ET
extension/llm
library. Modules that are modified to betorch.export
able will be hosted in ETextension/llm
, torchchat should use those as well.Example: torchtune’s
MultiHeadAttention
has an input dependent condition that needs to be rewritten into torch.cond so that it’s exportable. This lives inextension/llm/modules
and should be used by torchchat. [Pending discussion] If torchtune is open to host these exportable modules, torchchat should depend on torchtune to get them.Demo app:
For both Android and iOS, we want to build runner and tokenizer as libraries, package them into artifacts and distribute them to torchchat. We are already doing this for Android demo app iOS demo app code should live in torchchat as well and both demo apps should be removed from ET in the future.