openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
12.48k stars 856 forks source link

new method to truncate after N-tokens #236

Closed aleks-sch closed 10 months ago

aleks-sch commented 10 months ago

hi there!

issue description/motivation

We've a use-case where we would like to quickly and maximally truncate an input string after N-tokens.

Where trying to repeatedly count tokens for a substring risks that including next few chars would actually the bring token count down.

image

proposed approach

I am keen to put together a PR to do this - I am interested in coding more Rust and it feels doable. Making the change in the Rust code would prevent us from having to perform multiple calls to Encoding.encode and counting tokens for growing substrings.

suggested approach follow signature of Encoding.decode_with_offsets https://github.com/openai/tiktoken/blob/9e79899bc248d5313c7dd73562b5e211d728723d/tiktoken/core.py#L279-L302

e.g.

really_long_str = "really long input string..."
encoding.truncate_after_n_tokens(inp=really_long_str, n_tokens=2)
"really long", [54760, 1317]

specific questions

thanks!

apouliotfigma commented 10 months ago

Why not tokenize the string, then truncate the list of tokens to [0,N) instead?

aleks-sch commented 10 months ago

thanks - that is super simple and solves my problem!

closing the issue