Closed Bihan closed 2 months ago
@tengomucho
you are right, the right parameter to use is "MAX_INPUT_TOKENS", I mixed things up. My point is: what do you want to achieve by adding this to the cli in the text generation server? For now IIRC this value is used in the launcher to define the maximum number of input tokens that can be passed from the router to the server. The server for now does not use that. It is ok to add it to the cli, but to be effective you will also need to add it to the serve function, and do something with it, otherwise it will not have any effect.
We only require MAX_TOTAL_TOKENS
and MAX_BATCH_PREFILL_TOKENS,
but looking deeper intoTGI v2.2.0
launcher's main.rs; we found that launcher always passes--max-input-tokens
argument to the serve
function. This is why we cli.py
serve function is issuing Error: No such option: --max-input-tokens rank=0
So how can we handle this --max-input-tokens
parameter inside cli.py
?
We could do that @Bihan , but that would mean that we would end up with a code that diverges more from the original transformers' code. I was suggesting to stay as close as possible to their implementation to simplify maintenance: if there is a new update to transformers to support a new feature or fix a bug, it will be easier to support it if optimum-tpu code is more similar.
I understand your point about maintaining alignment with the original transformers’ code to simplify updates and maintenance. After reviewing the changes in modelling_llama.py
from Transformers v4.43.4
, I believe the best approach might be to create a new file, optimum/tpu/modelling_llama.py
, specifically tailored for TPU.
In the meantime, as a workaround to address the rope_scaling
issue, is there a way to load a custom rope_scaling configuration instead of relying on the default rope_scaling in LLaMA 3.1’s config.json?
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
This PR is closed as it does not contain recent changes in the main branch. A new PR is created as replacement.
FYI @Bihan, next time you can just rebase to the master branch and force-push:
git checkout main
git pull
git checkout mybranch
git rebase main
# resolve conflicts
git push --force
This way you do not need to open a new PR 🤗
What does this PR do?
This PR adds Llama 3.1 8B support. Please refer to the PR for discussion history.
Fixes