OpenCLIP's tokenizer is slightly different than OpenAI's CLIP tokenizer.

mlfoundations / open_clip

An open source implementation of CLIP.

Other

10.25k stars 979 forks source link

OpenCLIP's tokenizer is slightly different than OpenAI's CLIP tokenizer. #453

Closed vedantroy closed 1 week ago

vedantroy commented 1 year ago

Small discrepancy noticed between the 2 tokenizers:

https://github.com/openai/CLIP uses special tokens in the form <|startoftext|>
this repository uses special tokens of the form <start_of_text>

Not a huge difference, but figured I'd make sure there's no reason for this.

vedantroy commented 1 year ago

Following up on this, this is a slightly patched version of the tokenizer in this repo (just made one of the methods an instance method instead of a module method):

    tokenizer = clip_tokenizer.SimpleTokenizer(
        bpe_path=str(vocab_path)
    )
    output = tokenizer.tokenize("hello world", context_length=77) 
    print(output.shape)
    decoded = tokenizer.decode_tensor(output[0])
    print(decoded)

And the decoded output is:

<start_of_text>hello world <end_of_text>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

I'm assuming the exclamation points at the end are fine?

gpucce commented 1 year ago

Hi @vedantroy for the different special tokens I don´t know if there is a specific reason, for the exclamation marks, the reason they are there is that the tokenizer uses 0 as padding index but it is also the index assigned to the exclamation mark, unfortunately I think changing this would be a lot of work as all existing models work like this

rwightman commented 1 week ago

fine as long as the right ones are used, the outputs are equivalent, side note I think the transformers clip tokenizer had/possibly still has an issue where the zero padding wasn't extended out to the end like the original openai.