[Bug] Phi-3-vision-128k-instruct Tokenizer failing constructor

riedgar-ms commented 5 months ago

The bug

The 'quick spot check to verify we can rebuild complex multi-token unicode symbols' check in the TransformersTokenizer constructor is failing for Phi-3-vision-128k-instruct.

This issue is being split out of #880 , which may well encapsulate other bugs.

To Reproduce

See PR #899 which introduces a test_tokenizers.py file to help with these.

System info (please complete the following information):

OS (e.g. Ubuntu, Windows 11, Mac OS, etc.): Win, but appears general
Guidance Version (guidance.__version__): Working from repo

riedgar-ms commented 5 months ago

Tagging @slundberg for suggestions on how to address this. I've been looking at the history of that spot check, and I'm not yet clear on how issues it raises are best fixed.

riedgar-ms commented 4 months ago

Some thoughts after partial investigation.

The problem occurs in the 'fall back to GPT2 byte-decoder' portion of the TransformersTokenizer constructor. The Phi-3-vision-128k-instruct model is using a Llama tokeniser here, Phi-2 is using a CodeGen tokeniser.

This portion of the code round-trips a string s = "’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨" through the hybrid tokeniser being created - it is tokenised using the model's tokeniser, but then the byte_decoder from GPT2 is used to turn those tokens back into bytes. The actual problem occurs in transformers_tokenizer.convert_ids_to_tokens(i). On the first character in this string, "’" (a 'curly' apostrophe), the Llama tokeniser used by Phi-3-vision-128k-instruct is returning the character exactly (there's an extra complication around a <s> being prepended); the problem the curly apostrophe is not in the byte decoder (its `ord() value is over 8000, although it turns out that's not the key issue). The byte decode then throws an exception.

In contrast, CodeGen tokeniser used by Phi-2 does something a little different. In this case convert_ids_to_tokens() returns multiple characters. The first two tokens give the three characters âĢĻ. Each of these is in the byte decoder, and so things can then proceed. Now, "’".encode() is three bytes long, and the first byte is actually ord(â), but the other two characters have ord() values beyond a single byte (although, as noted, there are in the GPT2 byte decode).

I'm not quite sure what to do with this information yet, but I'm working on it.

wllhf commented 3 days ago

Any news on this? Trying to get Phi3.5-Vision to run via Transformers.

guidance-ai / guidance

[Bug] Phi-3-vision-128k-instruct Tokenizer failing constructor #902