new method to truncate after N-tokens

aleks-sch commented 10 months ago

hi there!

issue description/motivation

We've a use-case where we would like to quickly and maximally truncate an input string after N-tokens.

Where trying to repeatedly count tokens for a substring risks that including next few chars would actually the bring token count down.

proposed approach

I am keen to put together a PR to do this - I am interested in coding more Rust and it feels doable. Making the change in the Rust code would prevent us from having to perform multiple calls to Encoding.encode and counting tokens for growing substrings.

suggested approach follow signature of Encoding.decode_with_offsets https://github.com/openai/tiktoken/blob/9e79899bc248d5313c7dd73562b5e211d728723d/tiktoken/core.py#L279-L302

e.g.

really_long_str = "really long input string..."
encoding.truncate_after_n_tokens(inp=really_long_str, n_tokens=2)
"really long", [54760, 1317]

specific questions

initial thoughts on this? feasibility, anything I am missing before I start?
your early support to review a PR?

thanks!

apouliotfigma commented 10 months ago

Why not tokenize the string, then truncate the list of tokens to [0,N) instead?

aleks-sch commented 10 months ago

thanks - that is super simple and solves my problem!

closing the issue

openai / tiktoken