CLIPTokenizer differs from HuggingFace on Chinese inputs

Craigacp commented 8 months ago

I've exported openai/clip-vit-base-patch32 from HuggingFace into a single op ONNX model which uses CLIPTokenizer. When comparing the behaviour to the original HF tokenizer I'm seeing an issue with the tokenization of Chinese characters. HuggingFace appends </w> to the end of each glyph, and the ONNX op doesn't, so we get different tokenizations. I ran basically the same code to export gpt2 from HuggingFace and with the same string the tokenization matches, so I think this is something in how HuggingFace's CLIP implementation implements the byte level fallback that isn't mirrored properly into the ONNX op. It's also doubling the <|endoftext|> character, not sure what's causing that. I couldn't see any relevant attributes or other arguments to the op to change its behaviour. The op does correctly tokenize some examples of western European languages that I'd put in the test set.

The example is given below (it's "This is a test string which checks the tokenizer" fed through Google Translate as I don't speak Chinese):

Original - '['这是一个检查标记生成器的测试字符串']'
HF Tok - '['<|startoftext|>', 'è', '¿', 'Ļ</w>', 'æĺ', '¯</w>', 'ä¸', 'Ģ</w>', 'ä¸', 'ª</w>', 'æ', '£', 'Ģ</w>', 'æ', 'Ł', '¥</w>', 'æ', 'ł', 'ĩ</w>', 'è', '®', '°</w>', 'çĶ', 'Ł</w>', 'æ', 'Ī', 'Ĳ</w>', 'å', 'Ļ', '¨</w>', 'ç', 'ļ', 'Ħ</w>', 'æ', 'µ', 'ĭ</w>', 'è', '¯', 'ķ</w>', 'åŃ', 'Ĺ</w>', 'ç', '¬', '¦</w>', 'ä¸', '²</w>', '<|endoftext|>']'
ONNX Tok - '['<|startoftext|>', 'è', '¿', 'Ļ', 'æĺ', '¯', 'ä¸', 'Ģ', 'ä¸', 'ª', 'æ', '£', 'Ģ', 'æ', 'Ł', '¥', 'æ', 'ł', 'ĩ', 'è', '®', '°', 'çĶŁ', 'æ', 'Ī', 'Ĳ', 'å', 'Ļ', '¨', 'ç', 'ļ', 'Ħ', 'æ', 'µ', 'ĭ', 'è', '¯', 'ķ', 'åŃ', 'Ĺ', 'ç', '¬', '¦', 'ä¸', '<|endoftext|>', '<|endoftext|>']'

I'm constructing the CLIPTokenizer model using this code:

    # extract vocab
    vocab = json.dumps(tokenizer.encoder, separators=(',', ':'))

    # extract merges
    sorted_merges = {v_: k_ for k_, v_ in tokenizer.bpe_ranks.items()}
    merges = "\n".join("{} {}".format(*sorted_merges[n_]) for n_ in range(len(sorted_merges)))

    bpe_op = onnx_helper.make_node("CLIPTokenizer",
                                   inputs=["text_input"],
                                   outputs=["input_ids", "attention_mask"],
                                   name="clip_tokenizer",
                                   domain="com.microsoft.extensions",
                                   vocab=vocab,
                                   merges=merges,
                                   padding_length=-1,
                                   )

I checked it with onnxruntime-extensions 0.9.0 and with this commit (b072e94afd67cb96fc034432a4c3b57657e01465) from main.

### Tasks

wenbingl commented 8 months ago

there is no CJK handling in code base, we will take a look.

Craigacp commented 8 months ago

After installing ftfy I now get a different output from HF's tokenizer which is closer to the CLIPTokenizer output, but it still differs in the last character before the eos token. This happens with both the slow and fast variants. The last character before EOS from HF is ²</w> and the last one from the ONNX op is <|endoftext|> presumably because it hit an unk. According to HF ²</w> has id 366 in openai/clip-vit-base-patch32, but I'm not sure why that's not being picked up by the ONNX implementation.

microsoft / onnxruntime-extensions

CLIPTokenizer differs from HuggingFace on Chinese inputs #638