simonw / ttok

Count and truncate text based on tokens
Apache License 2.0
247 stars 7 forks source link

Mechanism for splitting rather than truncating #3

Open simonw opened 1 year ago

simonw commented 1 year ago

Suggestion from:

simonw commented 1 year ago

Proposed design from #2 is this:

cat big.txt | ttok --truncate 100 --split splitted.txt

This would produce files called:

splitted.txt.part0
splitted.txt.part1
splitted.txt.part2

Etc, up to the total number of parts.

I like the idea, but I'm not sold on that particular design (cc @c4pt0r)

simonw commented 1 year ago

A related challenge to this is recursive summarization - where you want to summarize a large document, so you split it into smaller sections, summarize each of those and then perform a summary against those summaries.

Doing that well is difficult - I've not yet figured out my own preferred pattern for it. I believe it ends up requiring a bit of token overlap between the sections in order to avoid things like a sentence that was cut off half way through summarizing incorrectly.

More about that pattern here: https://twitter.com/hwchase17/status/1587458160297533440

I'm not sure if implementing helpers for that would fit in the ttok tool or not - but worth mentioning here.

simonw commented 1 year ago

There are two alternative output schemes that I think are worth considering here:

  1. JSON. An option to output a JSON array of chunks, each fitting the desired size, would be neat.
  2. Maybe some kind of mechanism that can spawn multiple follow-on llm tasks? Kind of like how xargs works - or maybe a pattern that can be fed to xargs in a neat way. Not sure what that would look like though.

I'd like to avoid creating temporary files on disk if possible - happy to support that as an option, but I don't like it as a default.

simonw commented 1 year ago

Here's an option: xargs can work against \0 null delimited data. So maybe a -0/--null option could cause the split tokenized output to be separated by that, at which point it could be piped into xargs.

Could look something like this:

cat big.txt | ttok --truncate 100 --split --null | xargs -0 llm --system "summarize as bullet points"
simonw commented 1 year ago

Given that design, --split is a flag. It could default to outputting ["...", "..."] JSON, could output \0 delimited text with --null and could output to files if another option such as --split-file output was passed.

simonw commented 1 year ago

I'm not sure xargs on macOS can do what I want it to do here. I got as far as trying this:

curl 'https://simonwillison.net/' | strip-tags > big.txt
ttok < big.txt --split -t 2000 --null | xargs -0 -I{} zsh -c 'echo {} | llm --system "summarize as bullets" -s'

But it clearly wasn't passing each line of the data piped to xargs through to standard input to the llm command.

Here's the diff of my prototype:

diff --git a/ttok/cli.py b/ttok/cli.py
index 73e1652..e7f0641 100644
--- a/ttok/cli.py
+++ b/ttok/cli.py
@@ -1,4 +1,5 @@
 import click
+import json
 import sys
 import tiktoken

@@ -10,9 +11,13 @@ import tiktoken
 @click.option(
     "-t", "--truncate", "truncate", type=int, help="Truncate to this many tokens"
 )
+@click.option("--split", is_flag=True, help="Split text based on truncate argument")
+@click.option(
+    "-0", "--null", is_flag=True, help="Output split text with null byte delimiters"
+)
 @click.option("-m", "--model", default="gpt-3.5-turbo", help="Which model to use")
 @click.option("output_tokens", "--tokens", is_flag=True, help="Output token integers")
-def cli(prompt, input, truncate, model, output_tokens):
+def cli(prompt, input, truncate, split, null, model, output_tokens):
     """
     Count and truncate text based on tokens

@@ -36,6 +41,8 @@ def cli(prompt, input, truncate, model, output_tokens):

         cat input.txt | ttok --tokens
     """
+    if split and not truncate:
+        raise click.ClickException("Cannot use --split without --truncate")
     try:
         encoding = tiktoken.encoding_for_model(model)
     except KeyError as e:
@@ -51,6 +58,17 @@ def cli(prompt, input, truncate, model, output_tokens):
             text = input_text
     # Tokenize it
     tokens = encoding.encode(text)
+
+    if split:
+        output = []
+        for chunk in chunks(tokens, truncate):
+            output.append(encoding.decode(chunk))
+        if null:
+            click.echo("\0".join(output))
+        else:
+            click.echo(json.dumps(output, indent=2))
+        return
+
     if truncate:
         tokens = tokens[:truncate]

@@ -60,3 +78,8 @@ def cli(prompt, input, truncate, model, output_tokens):
         click.echo(encoding.decode(tokens), nl=False)
     else:
         click.echo(len(tokens))
+
+
+def chunks(sequence, n):
+    for i in range(0, len(sequence), n):
+        yield sequence[i : i + n]
simonw commented 1 year ago

This did work though:

ttok < big.txt --split -t 2000 --null | while IFS= read -r -d $'\0' chunk; do
    echo "$chunk" | llm --system 'summarize as bullets' -s
done
simonw commented 1 year ago

I'm trying to figure out how safe null bytes are - can they occur in tokenized text?

I think maybe they can:

echo "\0" | ttok --tokens
188 198

198 is newline - so is 188 null byte?

If so, then the --null option should strip those out of the token stream.

simonw commented 1 year ago

The token ID for the null byte likely differs for different models.

Trying to figure out where the models come from by reading this code: https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py

simonw commented 1 year ago
>>> import tiktoken
>>> encoding = tiktoken.encoding_for_model("gpt2")
>>> encoding.encode("\0")
[188]
>>> encoding.encode("hello world")
[31373, 995]
>>> encoding2 = tiktoken.encoding_for_model("gpt-3.5-turbo")
>>> encoding2.encode("\0")
[188]
>>> encoding2.encode("hello world")
[15339, 1917]

It looks like both those models use 188 for the null byte. I should still look it up in the model before filtering it though.

simonw commented 1 year ago

This is pretty good!

% ttok < big.txt --split -t 2000 --null | while IFS= read -r -d $'\0' chunk; do
    echo "$chunk" | llm --system 'summarize as bullets' -s
done
- Simon Willison has built three command-line tools named llm, ttok, and strip-tags for working with ChatGPT and other LLMs.
- Tokens are how pricing works to work with LLMs and being able to count tokens is essential.
- strip-tags and ttok helps in better ways of working with tokens.
- strip-tags command strips out HTML tags which usually aren't relevant to the prompt you are sending to the model.
- llm command helps in piping content to a model opens up all kinds of fun opportunities.
- Prompt injection remains an unsolved problem and delimiters are easily defeated.
- ChatGPT Prompt Engineering for Developers is an interactive video course presented in partnership with OpenAI which covers prompt engineering.
- OpenAI's attempt to use delimiters to avoid prompt injections is ineffective
- Attackers have many options to confound language models with a sequence of tokens
simonw commented 1 year ago

Bit of a side exploration of hexdump - I figured out how to prove to myself that I was producing a null byte from Python:

% python -c 'import sys; sys.stdout.buffer.write(b"a")' | hexdump -C
00000000  61                                                |a|
00000001
% python -c 'import sys; sys.stdout.buffer.write(b"a\0")' | hexdump -C
00000000  61 00                                             |a.|
00000002

Without the -C option hexdump was showing confusing output because of little-endian v.s. big-endian issues:

% python -c 'import sys; sys.stdout.buffer.write(b"a")' | hexdump     
0000000 0061                                   
0000001
% python -c 'import sys; sys.stdout.buffer.write(b"a\0")' | hexdump   
0000000 0061                                   
0000002
simonw commented 1 year ago

Now I can test the null byte handling:

% python -c 'import sys; sys.stdout.buffer.write(b"one\0two\0three\0")' | ttok --split -t 2                    
[
  "one\u0000",
  "two\u0000",
  "three\u0000"
]

Without the code I wrote to filter out null bytes when using --null:

% python -c 'import sys; sys.stdout.buffer.write(b"one\0two\0three\0")' | ttok --split -t 2 --null | hexdump -C
00000000  6f 6e 65 00 00 74 77 6f  00 00 74 68 72 65 65 00  |one..two..three.|
00000010  0a                                                |.|
00000011

Note how those null bytes have been doubled up 00 00 in a couple of places, which would then confuse the shell loop:

python -c 'import sys; sys.stdout.buffer.write(b"one\0two\0three\0")' | ttok --split -t 1 --null | \
while IFS= read -r -d $'\0' chunk; do
    echo "Chunk:"
    echo "$chunk" | hexdump -C
done
Chunk:
00000000  6f 6e 65 0a                                       |one.|
00000004
Chunk:
00000000  0a                                                |.|
00000001
Chunk:
00000000  74 77 6f 0a                                       |two.|
00000004
Chunk:
00000000  0a                                                |.|
00000001
Chunk:
00000000  74 68 72 65 65 0a                                 |three.|
00000006

But if I then add this Python code:

    if split:
        output = []
        if null:
            # Filter out null byte tokens
            null_token = encoding.encode("\0")[0]
            tokens = [t for t in tokens if t != null_token]

I get this result instead (I had to drop -t to 1 because the null bytes had been filtered out):

python -c 'import sys; sys.stdout.buffer.write(b"one\0two\0three\0")' | ttok --split -t 1 --null | while IFS= read -r -d $'\0' chunk; do
    echo "Chunk:"
    echo "$chunk" | hexdump -C
done
Chunk:
00000000  6f 6e 65 0a                                       |one.|
00000004
Chunk:
00000000  74 77 6f 0a                                       |two.|
00000004

Which is wrong - three is missing and I don't understand why.

Side note: ttok --split -t 2 --tokens should output a JSON list of list of integers.

simonw commented 1 year ago

This seems to be doing the right thing:

python -c 'import sys; sys.stdout.buffer.write(b"one\0two\0three\0")'  | ttok --split -t 1 --null | hexdump -C
00000000  6f 6e 65 00 74 77 6f 00  74 68 72 65 65 0a        |one.two.three.|
0000000e

So it must be a problem with that | while pattern I'm using.

simonw commented 1 year ago

Figured out how to reverse hexdump -C with xxd -r -p (-r means reverse,-p indicates plain):

python -c 'import sys; sys.stdout.buffer.write(b"one\0two\0three\0")' | ttok --split -t 1 --null | hexdump -C
00000000  6f 6e 65 00 74 77 6f 00  74 68 72 65 65 0a        |one.two.three.|
0000000e

echo "6f 6e 65 00 74 77 6f 00  74 68 72 65 65 0a" | xxd -r -p | hexdump -C
00000000  6f 6e 65 00 74 77 6f 00  74 68 72 65 65 0a        |one.two.three.|
0000000e
simonw commented 1 year ago
echo "6f 6e 65 00 74 77 6f 00 74 68 72 65 65 0a" | xxd -r -p | while IFS= read -r -d $'\0'  chunk; do
    echo "Chunk: $chunk"
done
Chunk: one
Chunk: two

Why does that not output three?

I posed that question to GPT-4 and it said:

The command does not output "three" because of the trailing newline character (represented by 0a in hex).

When the read command in the loop encounters a null character (represented by 00 in hex), it correctly interprets it as the delimiter and reads "one" and "two" as separate chunks.

However, after "two", the next delimiter it encounters is the newline character, not a null character. Because read is set to use null characters as delimiters, it treats the remainder of the input ("three\n") as a single chunk.

However, this final chunk is not printed because the read command returns a non-zero exit status when it reaches the end of the input. In a while loop, a non-zero exit status breaks the loop, so the final echo command is not executed.

One way to solve this problem and print all chunks, including the final one, is to add an additional null character at the end of the input. Here's how you can do it:

echo -e "6f 6e 65 00 74 77 6f 00 74 68 72 65 65 0a 00" | xxd -r -p | while IFS= read -r -d $'\0' chunk; do
    echo "Chunk: $chunk"
done
simonw commented 1 year ago

So in --null mode I should add a \0 at the end as well.

I tried that and it did indeed work:

python -c 'import sys; sys.stdout.buffer.write(b"one\0two\0three\0")' | ttok --split -t 1 --null | \
while IFS= read -r -d $'\0' chunk; do
    echo "Chunk:"
    echo "$chunk" | hexdump -C
done
Chunk:
00000000  6f 6e 65 0a                                       |one.|
00000004
Chunk:
00000000  74 77 6f 0a                                       |two.|
00000004
Chunk:
00000000  74 68 72 65 65 0a                                 |three.|
00000006
simonw commented 1 year ago

Current prototype:

diff --git a/ttok/cli.py b/ttok/cli.py
index 73e1652..ddce8a5 100644
--- a/ttok/cli.py
+++ b/ttok/cli.py
@@ -1,4 +1,5 @@
 import click
+import json
 import sys
 import tiktoken

@@ -10,9 +11,13 @@ import tiktoken
 @click.option(
     "-t", "--truncate", "truncate", type=int, help="Truncate to this many tokens"
 )
+@click.option("--split", is_flag=True, help="Split text based on truncate argument")
+@click.option(
+    "-0", "--null", is_flag=True, help="Output split text with null byte delimiters"
+)
 @click.option("-m", "--model", default="gpt-3.5-turbo", help="Which model to use")
 @click.option("output_tokens", "--tokens", is_flag=True, help="Output token integers")
-def cli(prompt, input, truncate, model, output_tokens):
+def cli(prompt, input, truncate, split, null, model, output_tokens):
     """
     Count and truncate text based on tokens

@@ -36,6 +41,8 @@ def cli(prompt, input, truncate, model, output_tokens):

         cat input.txt | ttok --tokens
     """
+    if split and not truncate:
+        raise click.ClickException("Cannot use --split without --truncate")
     try:
         encoding = tiktoken.encoding_for_model(model)
     except KeyError as e:
@@ -51,6 +58,28 @@ def cli(prompt, input, truncate, model, output_tokens):
             text = input_text
     # Tokenize it
     tokens = encoding.encode(text)
+
+    if split:
+        if null:
+            # Filter out null byte tokens
+            null_token = encoding.encode("\0")[0]
+            tokens = [t for t in tokens if t != null_token]
+        token_chunks = list(chunks(tokens, truncate))
+        if null:
+            click.echo(
+                "\0".join(encoding.decode(chunk) for chunk in token_chunks) + "\0"
+            )
+        else:
+            if output_tokens:
+                click.echo(json.dumps(token_chunks, indent=2))
+            else:
+                click.echo(
+                    json.dumps(
+                        [encoding.decode(chunk) for chunk in token_chunks], indent=2
+                    )
+                )
+        return
+
     if truncate:
         tokens = tokens[:truncate]

@@ -60,3 +89,8 @@ def cli(prompt, input, truncate, model, output_tokens):
         click.echo(encoding.decode(tokens), nl=False)
     else:
         click.echo(len(tokens))
+
+
+def chunks(sequence, n):
+    for i in range(0, len(sequence), n):
+        yield sequence[i : i + n]
simonw commented 1 year ago

I'm going to provide three options for splitting the incoming text:

simonw commented 1 year ago

And the option originally suggested in #2 that outputs the splits to a set of files:

FergusFettes commented 9 months ago

Looks like great progress on this, looking forward to seeing the update!

In the meantime, here is a script that does chunking with overlap to a folder in case anyone needs it, you can probably modify this for your purposes:

#!/bin/bash

# Set the maximum number of tokens per chunk
MAX_TOKENS=1000
# Set the default overlap of tokens if not specified
OVERLAP=400

# Get the number of tokens for the overlap from the command line if provided
if [ ! -z "$2" ]; then
  OVERLAP=$2
fi

# File containing the input text
INPUT_FILE=$1

# Prefix for output files
OUTPUT_PREFIX="chunk"

# Output folder based on the input file name without extension
OUTPUT_FOLDER="$(dirname "$INPUT_FILE")/$(basename "$INPUT_FILE" .txt)/"
mkdir -p "$OUTPUT_FOLDER"

# Initialize chunk counter and start position
FILE_COUNTER=1
START_POS=0

# Read the entire input text into a variable
FULL_TEXT=$(<"$INPUT_FILE")

# Convert the full text into a stream of tokens and count the total
TOTAL_TOKENS=$(echo "$FULL_TEXT" | ttok --encode | wc -w)

echo "Total tokens: $TOTAL_TOKENS, max tokens per chunk: $MAX_TOKENS, overlap: $OVERLAP, output folder: $OUTPUT_FOLDER. Starting chunking, starting position: $START_POS"

# Process the input text into chunks with the specified overlap
while [ $START_POS -lt $TOTAL_TOKENS ]; do
  echo "Processing chunk $FILE_COUNTER. Start position: $START_POS of $TOTAL_TOKENS"

  # Calculate end position of the chunk allowing for overlap on subsequent chunks
  END_POS=$(($START_POS + $MAX_TOKENS > $TOTAL_TOKENS ? $TOTAL_TOKENS : $START_POS + $MAX_TOKENS))

  # Extract chunk of tokens from the start position to the end position
  CHUNK_TEXT=$(echo "$FULL_TEXT" | ttok --encode | cut -d' ' -f$(($START_POS + 1))-$(($END_POS)) | ttok --decode)

  # Write the chunk to a file
  echo "$CHUNK_TEXT" > "${OUTPUT_FOLDER}${OUTPUT_PREFIX}${FILE_COUNTER}.txt"

  # Increment the chunk counter
  ((FILE_COUNTER++))

  # After calculating END_POS
  if [ $END_POS -eq $TOTAL_TOKENS ]; then
    # If we've reached the end of the tokens, we can exit the loop after this chunk
    START_POS=$TOTAL_TOKENS
  else
    # Allow overlap on subsequent chunks unless it's the last chunk
    START_POS=$(($END_POS - $OVERLAP))
    # Ensuring START_POS does not become less than 0
    if [ $START_POS -lt 0 ]; then
      START_POS=0
    fi
  fi

done

echo "Chunks created in $OUTPUT_FOLDER"