better visualization for the merge process in the educational implementation

youkaichao commented 1 year ago

Previous visualization:

Now:

Better to see which two tokens are merged!

youkaichao commented 1 year ago

Well, I find that colors are clear enough to see single tokens, but I have a difficult time to find those two tokens to merge (my eyes require a linear scan of adjacent lines to identify the tokens to merge).

Regarding to the non-ASCII support, since there might be sub-unicode character when merging tokens (a token is just a part of unicode character), actually visualise_tokens only has a limited support for non-ASCII.

But anyway, this is only for educational purpose, right? I think we can focus on tokenization ASCII string.

The image below shows the case of encoding the Chinese version of "hello, world". Exception occurs as expected.

youkaichao commented 1 year ago

Oh, I know, the difficulty of telling the tokens to merge is that the color mapping for tokens change across lines.

Take the following image as an example:

h e l l o h el l o

Natually, when e and l get merged, I would suppose that the second l and the trailing o remains unchanged in color. However, there is actually a color shift for these tokens. Therefore, my eyes start to complain.

Adding ^ helps my eyes to avoid matching colors across lines.

hauntsaninja commented 1 year ago

Okay, I see. Maybe we do something like:

This will prevent things shifting over.

diff --git a/tiktoken/tiktoken/_educational.py b/tiktoken/tiktoken/_educational.py
index 692a8bb8b2e..c7d9f0194f1 100644
--- a/tiktoken/tiktoken/_educational.py
+++ b/tiktoken/tiktoken/_educational.py
@@ -185,11 +185,23 @@ def bpe_train(

 def visualise_tokens(token_values: list[bytes]) -> None:
-    backgrounds = itertools.cycle(
-        [f"\u001b[48;5;{i}m".encode() for i in [167, 179, 185, 77, 80, 68, 134]]
-    )
-    interleaved = itertools.chain.from_iterable(zip(backgrounds, token_values))
-    print((b"".join(interleaved) + "\u001b[0m".encode()).decode("utf-8"))
+    background = [f"\u001b[48;5;{i}m" for i in [167, 179, 185, 77, 80, 68, 134]]
+    # If token boundaries do not occur at unicode character boundaries, it's unclear how best to
+    # visualise the token. Here, we'll just use the unicode replacement character to represent some
+    # fraction of a character.
+    unicode_token_values = [x.decode("utf-8", errors="replace") for x in token_values]
+
+    running_length = 0
+    last_color = None
+    for token in unicode_token_values:
+        color = background[running_length % len(background)]
+        if color == last_color:
+            color = background[(running_length + 1) % len(background)]
+            assert color != last_color
+        last_color = color
+        running_length += len(token)
+        print(color + token, end="")
+    print("\u001b[0m")

Let me know what you think! :-)

youkaichao commented 1 year ago

This looks good visually, but will run into problems when consecutive tokens are merged seven times, given the fact that you only use 7 colors.

For example, a string with 8 tokens x:

x x x x x x x x
xx x x x x x x
xxx x x x x x
xxxx x x x x
xxxxx x x x
xxxxxx x x
xxxxxxx x (Oops, they are two tokens but with the same color)

The first token and the last token share the same color in the begining. If the merge happens to occur from the left to the right, then the color cannot indicate the boundary

I don't know if it is a real concern. The token x can be abstract. I don't literally mean the letter "x".

hauntsaninja commented 1 year ago

Hmm that's what the if color == last_color: case in my code above is meant to handle — but maybe I'm missing something!

youkaichao commented 1 year ago

Oh, didn't notice your full code. Github is bad for only displaying part of code without noticing me that there is a slider to show more code :-(

I think your new code is fine, adjacent tokens get different colors. Adjacent lines try to keep the same color for tokens starting at the same position.

LGTM!

youkaichao commented 1 year ago

Additional comment, the color scheme can be improved. Color 179 and 185 looks somewhat similar. So do 77 and 80.

I tried to find appropriate colors from https://en.wikipedia.org/wiki/ANSI_escape_code , and land with these colors:

[1, 2, 3, 5, 6, 7, 9, 10, 11, 13, 14]

These eleven colors are visually different with adjacent colors.

I removed black/white/blue colors as they make it hard to see the foreground token text.

youkaichao commented 1 year ago

Hi, @hauntsaninja , I integrated your code to avoid color shifting between consecutive rows, and use more distinguishable colors between adjacent tokens. Now the effect looks the following, which I think is much visually helpful now. What do you think?

My testcase is:

enc = SimpleBytePairEncoding.from_tiktoken("cl100k_base")
enc.encode("hello world aaaaaaaaaaaa")

hauntsaninja commented 1 year ago

Thanks, I merged the code from https://github.com/openai/tiktoken/pull/144#issuecomment-1586866438 a while ago, and this is now released in 0.5.0 :-)

youkaichao commented 1 year ago

Looks good to me. Although I think the color scheme [1, 2, 3, 5, 6, 7, 9, 10, 11, 13, 14] would be better for visual helpfulness, it is not an important concern.

openai / tiktoken

better visualization for the merge process in the educational implementation #144