openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.17k stars 751 forks source link

new method to truncate after N-tokens #236

Closed aleks-sch closed 6 months ago

aleks-sch commented 6 months ago

hi there!

issue description/motivation

We've a use-case where we would like to quickly and maximally truncate an input string after N-tokens.

Where trying to repeatedly count tokens for a substring risks that including next few chars would actually the bring token count down.

image

proposed approach

I am keen to put together a PR to do this - I am interested in coding more Rust and it feels doable. Making the change in the Rust code would prevent us from having to perform multiple calls to Encoding.encode and counting tokens for growing substrings.

suggested approach follow signature of Encoding.decode_with_offsets https://github.com/openai/tiktoken/blob/9e79899bc248d5313c7dd73562b5e211d728723d/tiktoken/core.py#L279-L302

e.g.

really_long_str = "really long input string..."
encoding.truncate_after_n_tokens(inp=really_long_str, n_tokens=2)
"really long", [54760, 1317]

specific questions

thanks!

apouliotfigma commented 6 months ago

Why not tokenize the string, then truncate the list of tokens to [0,N) instead?

aleks-sch commented 6 months ago

thanks - that is super simple and solves my problem!

closing the issue