simonw / ttok

Count and truncate text based on tokens
Apache License 2.0
247 stars 7 forks source link

Option to turn token integers back into text #7

Closed simonw closed 1 year ago

simonw commented 1 year ago

The opposite of this:

echo "Show these tokens" | ttok --tokens
# Outputs: 7968 1521 11460 198
simonw commented 1 year ago

Potential names for this:

I think I like --decode the most. It maps to the underlying .decode() method.

I could add --encode as an alias for --tokens for added consistency.

simonw commented 1 year ago

This should support space separated, comma separated and JSON arrays or integers.

I think I'll just use a \d+ regular expressions to parse integers out of the input.

simonw commented 1 year ago

Got GPT-4 to write this: https://chat.openai.com/share/a3c5da38-bfd0-423d-af7e-dbed7bfe5278

@click.option("--decode", "decode", is_flag=True, help="Decode token integers to text")

# ...

    if decode:
        # Use regex to find all integers in the input text
        tokens = [int(t) for t in re.findall(r'\d+', text)]
        decoded_text = encoding.decode(tokens)
        click.echo(decoded_text)
simonw commented 1 year ago

I needed this to help test:

simonw commented 1 year ago

Oops, did that work on the wrong branch.

simonw commented 1 year ago
$ ttok --tokens show me the tokens
3528 757 279 11460
$ ttok --encode show me the tokens
3528 757 279 11460
$ ttok --decode 3528 757 279 11460
show me the tokens