Set add special tokens to false

triton-inference-server / client

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.

BSD 3-Clause "New" or "Revised" License

520 stars 225 forks source link

Set add special tokens to false #672

Closed IzzyPutterman closed 1 month ago

dyastremsky commented 1 month ago

Thanks for this pull request! Why is this change needed?

CC: @debermudez Should this be configurable or always false?

IzzyPutterman commented 1 month ago

Thanks for this pull request! Why is this change needed?

CC: @debermudez Should this be configurable or always false?

For llama3 tokenizer it will always add an extra token to the start (that is not the BOS token). This makes output token count 2x larger and ITL 2x smaller.

dyastremsky commented 1 month ago

Thanks for this pull request! Why is this change needed? CC: @debermudez Should this be configurable or always false?

For llama3 tokenizer it will always add an extra token to the start (that is not the BOS token). This makes output token count 2x larger and ITL 2x smaller.

Thanks for finding this and explaining!

The testing is failing after these changes. We can't merge unless testing passes. I just rebased the branch -- we'll see if that resolves the issue.

IzzyPutterman commented 1 month ago

I am probably missing something, but failing test is in llm-inputs which doenst make any calls to llm-metrics. I think the CI has a separate validation issue in downloading a dataset (I see a 500 error).

dyastremsky commented 1 month ago

I am probably missing something, but failing test is in llm-inputs which doenst make any calls to llm-metrics. I think the CI has a separate validation issue in downloading a dataset (I see a 500 error).

CC: @debermudez for reference. This was passing for the latest PR. I think he ran into a similar issue and is working on something that could be helpful here.

nv-hwoo commented 1 month ago

@IzzyPutterman Sorry for checking this late. Should this be taken into consideration when generating a prompt by SyntheticPromptGenerator?

IzzyPutterman commented 1 month ago

It might be nice to add there, but I wouldnt expect any big changes as we are tokenizer the full prompt at once so we add 1 token to full number, while in ITL calculation we are adding 1 token to every chunk (instead of the full thing).