Open Craigacp opened 8 months ago
there is no CJK handling in code base, we will take a look.
After installing ftfy
I now get a different output from HF's tokenizer which is closer to the CLIPTokenizer output, but it still differs in the last character before the eos token. This happens with both the slow and fast variants. The last character before EOS from HF is ²</w>
and the last one from the ONNX op is <|endoftext|>
presumably because it hit an unk. According to HF ²</w>
has id 366 in openai/clip-vit-base-patch32
, but I'm not sure why that's not being picked up by the ONNX implementation.
I've exported
openai/clip-vit-base-patch32
from HuggingFace into a single op ONNX model which usesCLIPTokenizer
. When comparing the behaviour to the original HF tokenizer I'm seeing an issue with the tokenization of Chinese characters. HuggingFace appends</w>
to the end of each glyph, and the ONNX op doesn't, so we get different tokenizations. I ran basically the same code to exportgpt2
from HuggingFace and with the same string the tokenization matches, so I think this is something in how HuggingFace's CLIP implementation implements the byte level fallback that isn't mirrored properly into the ONNX op. It's also doubling the<|endoftext|>
character, not sure what's causing that. I couldn't see any relevant attributes or other arguments to the op to change its behaviour. The op does correctly tokenize some examples of western European languages that I'd put in the test set.The example is given below (it's "This is a test string which checks the tokenizer" fed through Google Translate as I don't speak Chinese):
I'm constructing the CLIPTokenizer model using this code:
I checked it with onnxruntime-extensions 0.9.0 and with this commit (
b072e94afd67cb96fc034432a4c3b57657e01465
) frommain
.