Open gabe-l-hart opened 4 weeks ago
Note: Links to docs will display an error until the docs builds have been completed.
As of commit 10918a124869bbcdef1cdd8a19ac2a0b8d6605c3 with merge base 6895a18b994bf910c3d6d6c9d55c93504448ec90 (): :green_heart: Looks good so far! There are no failures yet. :green_heart:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Also, I used the following script to perform conversion of a pre-existing HF snapshot. It's similar to the if __name__ == "__main__"
block in convert_hf_checkpoint.py
:
I confirmed that it was falling back to the llama2
chat formatter because it wasn't using tiktoken
. I've added basic jinja2
chat template support when using the HF tokenizer.
A pointer to this PR and the example commands from the PR description would make a good starting point for docs/new_model.md to (at least partially?) address #1038 / #1041 in conjunction with some explanatory text
# wget artifacts here
# Run with direct reference to artifacts
python torchchat.py generate \
--prompt "Write a python function to sort numbers and strings with numeric prefixes" \
--checkpoint-path $HOME/models/ibm-granite/granite-3b-code-instruct-128k/model.pth \
--tokenizer-path $HOME/models/ibm-granite/granite-3b-code-instruct-128k/tokenizer.json \
--params-path torchchat/model_params/Granite-3B-Code.json
Explain how to add to model list....
# Run with alias
python torchchat.py generate granite-code \
--prompt "Write a python function to sort numbers and strings with numeric prefixes"
if added to .ci/scripts/run-docs new_model
it might also make a testcase for the features used in granite.
@Jack-Khuu I'm a bit stumped trying to get the 8B model working. I'm trying to mentally diff the Attention implementation in torchchat vs transformers to see if I can find anything that would indicate something behaving differently with Grouped Query Attention.
I'm not really following the different way that the torchchat
version is manipulating the tensors for tensor parallel inference (need to do some background reading there), but this feels like it's got to be close to the root of the issue. The only other place that I could imagine things going wrong is in the unpacking of the unified wqkv
here. Any insight you can offer would be much appreciated!
Thanks for the details @gabe-l-hart
I'll try to give it a gander this weekend. It's weird that 3B works, but 8B doesn't. I assume they use the same template so that at least clears that part
Looks like this has been open for a several weeks now. Yeah, the template thing is super hacky right now and I knew it was going to hang up our ability to add new models. In general we need to make a smoother path for new adding new models with different architectures, templates and storage locations.
It's been on @varunfb and @Jack-Khuu 's plate for a while but they've been swamped with other work. Fortunately it's planning season and the design for this is on the list. @gabe-l-hart would love to get your feedback on how best to support folks like yourself.
Thanks @byjlw! I definitely understand juggling priorities. The path to adding new models in the model_params
and model_config
is relatively straightforward (could use a doc, but TBH I never read docs anyway, so easy-to-read code is always best). The real challenge has come up around places where the models differ from the llama
series models. In particular, Granite Code uses the llama
architecture, but uses several optional bits that the Meta Llama models don't (e.g. HF tokenizers, tied embeddings). Getting these pieces to work has been a decently steep learning curve (fun though!). I think the thing that would be most helpful would be some kind of compatibility matrix doc that shows architectures that have support, sub-features within architectures, and which "layers" they're supported in (e.g. python, c++, executorch). This would help a lot in figuring out where to dig in to add new model support.
For the specific issues for Granite Code, the place I'm a bit stuck is trying to figure out why the 8B
model is flopping while the 3B
model is working just fine. My gut is that it has something to do with the alternate attention mechanism in TC, but I'm not deeply versed in attention enough to spot it quickly. The only architectural difference between 3B and 8B is the use of grouped query attention, so it's either something there or there's some incompatibility between the attention implementations in transformers
and TC that's only being exercised by the specific weights of the 8B. Any help and/or expert bug spotting would be much appreciated!
I just rebased on main
and it now looks like even the 3b model is producing only a single token as output in chat
mode. Will try to get to the bottom of it.
I just rebased on
main
and it now looks like even the 3b model is producing only a single token as output inchat
mode. Will try to get to the bottom of it.
Have you tried bisecting the 3B fail? Even if the change was legit and necessary, the type of change that would break the 3B model might give insight in how to "fix" both the 3B and 8B models? .
The real challenge has come up around places where the models differ from the
llama
series models. In particular, Granite Code uses thellama
architecture, but uses several optional bits that the Meta Llama models don't (e.g. HF tokenizers, tied embeddings). Getting these pieces to work has been a decently steep learning curve (fun though!).
I'm a bit surprised by this because chatgpt had this to say (understanding that I'm quoting chatgppt about an IBM model to an IBMer, so skating on seriously thin ice!!!):
what tokenization scheme does the ibm granite model use
Searched 4 sites The IBM Granite models, including its base and instruction-tuned variants, utilize the Llama2 tokenizer for tokenization. This choice aligns with the models’ architectural similarity to Meta's Llama2 series, such as the Granite-7b model, which follows the Llama2-7B architecture and employs similar tokenization strategies. These tokenizers are designed to handle diverse data sources, including programming languages and natural language, ensuring compatibility and efficiency in tasks like code synthesis and language understanding
So in theory, SentencePiece should do the trick? Is it the pre and post processing with regexps? (I think I saw some discussion about regexps in one of your PRs or issues?)
In any event, it's cool that we have HF tokenizers because they are a proper superset of SentencePiece+TikToken. (I think @lessw2020 and @kwen2501 had also added some HF tokenizer support for distributed if I remember correctly?)
Have you tried bisecting the 3B fail? Even if the change was legit and necessary, the type of change that would break the 3B model might give insight in how to "fix" both the 3B and 8B models?
That's on my todo list for my next chunk of uninterrupted dev time! I'm hoping that will be today.
I'm a bit surprised by this because chatgpt had this to say (understanding that I'm quoting chatgppt about an IBM model to an IBMer, so skating on seriously thin ice!!!):
Heh, as you know I'm sure, IBM is a big place, so I'm definitely doing a lot of learning myself in this space. My info from the models team is that we've been using the starcoder
tokenizer up until now (including Granite Code
and the Granite 3.0
series). When first trying to understand how best to support that in torchchat
, I was missing a lot of knowledge about sentencepiece
, so was working off of the tokenizer_config.json
in HF. I suspect it would be possible to reverse-convert from tokenizers
back to sentencepiece
for this config, but I haven't done that work yet since I was already halfway down the rabbit hole of tokenizers
support. We can certainly look into that as an alternative approach if the preference is to avoid the complexity of the c++ tokenizer buildout.
@Jack-Khuu @mikekgfb @byjlw I figured out where the issues were coming from. It was two things:
bos
token at the beginning of the sequence which the 3b model was sometimes ok with, but the 8b never was
tokenizer_prepend_bos
as a parameter in TransformerArgs
and ModelArgs
. It seemed a little klunky to plumb it through multiple blobs, but this got things working for both models with raw generationllama2
and llama3
templating. Solving this resulted in a fair bit of refactoring:
Llama2ChatFormatter
and Llama3ChatFormatter
to encapsulate all logic in a single abstract method encode_dialog_prompt
def chat
HFTokenizerChatFormatter
jinja2
through HFTokenizer
jinja2
was already a transitive dependency, so I just formalized itTo get to the bottom of all of this, I also tweaked the logging a bit. There was already a hard-coded logging config call in cli
, so I just added the ability to parse the LOG_LEVEL
env var to set it. I also added a fair number of additional log lines and uncommented some that were there but commented out.
NOTE: Many of the existing log lines were using f-strings which will cause the string to be interpolated regardless of whether the logger/level are enabled. I switched all of these to use lazy interpolation with percent-encoding so that it's safe to have them uncommented without a performance hit.
Finally, I was getting lost trying to make sure I didn't break anything in the chat templating, so I bit the bullet and added some basic unit tests. They only cover the chat formatting, but they're a place to start. I did not go any further with unit testing, including not adding pytest
as a dependency or adding any CI steps to invoke the tests. If you're interested, I'd be happy to push on unit testing, but I didn't want to lump that conversation into this PR.
Thanks @gabe-l-hart Yeah a lot of this feedback resonates really well with me, and resolving it has already made it on our H1 roadmap such as making it easy to have model specific templates, adding test infra and guidelines around tests, abstracting and making the code more modular so that there is a specific module for core with well defined APIs that the CLI and API can use. We will also figure out the release strategy and publish two or three specific pip packages.
Will be able to share the details soon and will have them as RFCs on GH so everyone can comment and contribute.
Dependencies
This PR is part of a sequence in support of adding Granite Code. It depends on merging the following PRs:
Issues
Closes #1262
Description
This PR adds support for Granite Code in 3B and 8B sizes. Given current limitations with the export of tokenizers, they will only work in the python environment with this PR.
Discussion
Usage
To test using these models, I did it both by running with the aliases and by running pointing directly at the checkpoint/tokenizer:
Open Questions
There are several outstanding issues, beyond the upstream tokenizers PR, that need to be solved before this PR is ready for full review:
chat
mode, the models produce very unreliable results, sometimes generating a single token while other times generating a reasonable result but stopping mid-sentence before reaching the max token limit. My current hypothesis is that the chat template is not being used anywhere and we're therefore using thellama
chat template automatically.