simonw / ttok

Count and truncate text based on tokens
Apache License 2.0
248 stars 7 forks source link

Options for special token handling #13

Closed grantjenks closed 3 months ago

grantjenks commented 5 months ago

Would be nice to have options for special token handling. I got this error today:

ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'.
If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|endoftext|>', ...}`.
If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'})`.
To disable this check for all special tokens, pass `disallowed_special=()`.
simonw commented 3 months ago

I just ran cross this myself: https://twitter.com/simonw/status/1786172033245597997

ttok '<|endoftext|>' --encode

I'm going to add a --allow-special option.

simonw commented 3 months ago

Filed this related issue:

simonw commented 3 months ago
ttok '<|endoftext|>' --encode --allow-special
100257