Bihan commented 3 months ago

What does this PR do?

This PR adds Llama 3.1 8B support. Please refer to the PR for discussion history.

Fixes

[ ] Bumped TGI version from 2.0.3 to 2.2.0
[ ] Bumped Transformers from 4.41.1 to 4.43.3
[ ] Bumped Rust from '1.77to1.79`
[ ] Added otlp_service_name argument in serve() method with default value text-generation-inference.server
[ ] Added max_input_tokens argument in serve() method with default value None
[ ] Modified modelling_llama.py for fixing rope_scaling issue.
[ ] Added Llama 3.1 8B test in test_decode.py

Bihan commented 3 months ago

@tengomucho

Regarding our discussion REF

you are right, the right parameter to use is "MAX_INPUT_TOKENS", I mixed things up. My point is: what do you want to achieve by adding this to the cli in the text generation server? For now IIRC this value is used in the launcher to define the maximum number of input tokens that can be passed from the router to the server. The server for now does not use that. It is ok to add it to the cli, but to be effective you will also need to add it to the serve function, and do something with it, otherwise it will not have any effect.

We only require MAX_TOTAL_TOKENS and MAX_BATCH_PREFILL_TOKENS, but looking deeper intoTGI v2.2.0 launcher's main.rs; we found that launcher always passes--max-input-tokens argument to the serve function. This is why we cli.py serve function is issuing Error: No such option: --max-input-tokens rank=0

So how can we handle this --max-input-tokens parameter inside cli.py?

Regarding our discussion REF

We could do that @Bihan , but that would mean that we would end up with a code that diverges more from the original transformers' code. I was suggesting to stay as close as possible to their implementation to simplify maintenance: if there is a new update to transformers to support a new feature or fix a bug, it will be easier to support it if optimum-tpu code is more similar.

I understand your point about maintaining alignment with the original transformers’ code to simplify updates and maintenance. After reviewing the changes in modelling_llama.py from Transformers v4.43.4, I believe the best approach might be to create a new file, optimum/tpu/modelling_llama.py, specifically tailored for TPU.

In the meantime, as a workaround to address the rope_scaling issue, is there a way to load a custom rope_scaling configuration instead of relying on the default rope_scaling in LLaMA 3.1’s config.json?

HuggingFaceDocBuilderDev commented 3 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Bihan commented 2 months ago

This PR is closed as it does not contain recent changes in the main branch. A new PR is created as replacement.

tengomucho commented 2 months ago

FYI @Bihan, next time you can just rebase to the master branch and force-push:

git checkout main
git pull
git checkout mybranch
git rebase main
# resolve conflicts
git push --force

This way you do not need to open a new PR 🤗

huggingface / optimum-tpu

Add llama 31 support #87

What does this PR do?

Fixes