2023/06/02 byte pair encodingメモ

imyutaro commented 1 year ago

byte pair encodingは、文字レベルとbyte levelの2種類あるっぽい

The GPT-2 and RoBERTa tokenizers (which are pretty similar) have a clever way to deal with this: they don’t look at words as being written with Unicode characters, but with bytes. This way the base vocabulary has a small size (256), but every character you can think of will still be included and not end up being converted to the unknown token. This trick is called byte-level BPE.

https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt

Notably, we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters.

LLaMAのtokenizerはSentencePieceの文字レベルbyte pair encodingで、未知語はbytesにしてるっぽい GPT-2はbyte-levelのBPE

imyutaro commented 1 year ago

[2305.07185] MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers これは、tokenizerの代わりに、固定長のパッチに区切ってtokenとして学習するらしい、Patches are all you needみたいな手法 [2201.09792] Patches Are All You Need? tokenizerフリーで、画像やaudioにも同様に処理できるからいいよってtokenizeの項で言ってる

imyutaro commented 1 year ago

このサイト https://platform.openai.com/tokenizer だと、日本語が適切にdecodeされないんだけど、内部のdecode処理ミスってるっぽいなこのライブラリ https://github.com/openai/tiktoken/ でトークンごとにdecodeして得られたbyte-stringをconcatしてからstr.decodeしたら、ちゃんと日本語にdecodeできた

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
assert enc.decode(enc.encode("hello world")) == "hello world"

# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("gpt-4")

inp = """
吾輩は猫である。名前はまだ無い。どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。吾輩はここで始めて人間というものを見た。しかもあとで聞くとそれは書生という人間中で一番獰悪な種族であったそうだ。この書生というのは時々我々を捕えて煮て食うという話である。しかしその当時は何という考もなかったから別段恐しいとも思わなかった。ただ彼の掌に載せられてスーと持ち上げられた時何だかフワフワした感じがあったばかりである。掌の上で少し落ちついて書生の顔を見たのがいわゆる人間というものの見始であろう。この時妙なものだと思った感じが今でも残っている。第一毛をもって装飾されべきはずの顔がつるつるしてまるで薬缶だ。その後猫にもだいぶ逢ったこんな片輪には一度も出会わした事がない。のみならず顔の真中があまりに突起している。そうしてその穴の中から時々ぷうぷうと煙を
"""

# うまくいく
print(enc.decode(enc.encode(inp)))

# うまくいく
dec = b"".join([enc.decode_single_token_bytes(i) for i in enc.encode(inp)]).decode("utf-8", errors="replace")
print(dec)

# うまくいかない
dec = " ".join([enc.decode_single_token_bytes(i).decode("utf-8", errors="replace") for i in enc.encode(inp)])
print(dec)

imyutaro commented 1 year ago

byte-levelのBPEで学習したモデルで生成したoutputは、byte-levelのトークンになるはずだけど、変なbyte列になることはないのかな？ byte（\x08\x84…）からdecodeした時に、変な文字列（こん˜åち…みたいな）になることがありそうだけど、この現象はどれくらい発生するんだろうか

imyutaro / note

2023/06/02 byte pair encodingメモ #6