openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
24.55k stars 3.2k forks source link

tokenizer does not produce a single integer for an exact match of a word for 1029 out of 34,483 vocab entries when it should #410

Open doctorpangloss opened 8 months ago

doctorpangloss commented 8 months ago
import clip
# downloads model
clip.tokenize(["a"])[0, 1]

from clip.simple_tokenizer import SimpleTokenizer 
tokenizer = SimpleTokenizer()
# match whole words
whole_words = {k: v for k, v in tokenizer.encoder.items() if k.endswith("</w>")}
to_trim = len("<w/>")
missed = 0
for token_str, token_int in whole_words.items():
  tokenized = tokenizer.encode(token_str[:-to_trim])
  if len(tokenized) != 1:
    missed += 1
print(f"openai/clip {missed} words out of {len(whole_words)} incorrectly tokenized ({missed/len(whole_words)*100})%")

this prints

openai/clip 1029 words out of 34483 incorrectly tokenized (2.9840791114462197)%