huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
129.39k stars 25.66k forks source link

CLIPTokenizer (and others based on the same telephoned OpenAI code) incorrect tokenize 1138 out of 34483 words that have an exact match in vocab #27961

Open doctorpangloss opened 7 months ago

doctorpangloss commented 7 months ago

System Info

Who can help?

@ArthurZucker and @younesbelkada

Information

Tasks

Reproduction

Visit https://colab.research.google.com/drive/18I0mYxTV-UCDjKWTxfuaR3P6o00w18Q9?usp=sharing for a reproduction.

from transformers import CLIPProcessor
tokenizer = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32").tokenizer

# match whole words
whole_words = {k: v for k, v in tokenizer.get_vocab().items() if k.endswith("</w>")}
to_trim = len("<w/>")
missed = 0
for token_str, token_int in whole_words.items():
  tokenized = tokenizer.tokenize(token_str[:-to_trim])
  if len(tokenized) != 1:
    missed += 1
print(f"transformers {missed} words out of {len(whole_words)} incorrectly tokenized ({missed/len(whole_words)*100})%")

this prints transformers 1138 words out of 34483 incorrectly tokenized (3.3001768987617086)%

I see that everyone copied OpenAI's buggy tokenization code. Besides this issue there is also https://github.com/openai/CLIP/issues/343. The code in that repository was obviously not used for training, so this could explain a lot of misses / poor performance in CLIP based models.

Expected behavior

tokenization of a word that exactly matches an entry in the vocab file should return exactly 1 token

ArthurZucker commented 7 months ago

I'll have a look

ArthurZucker commented 7 months ago

Few things that seem wrong to me:

doctorpangloss commented 7 months ago

why should each whole word be tokenized as a single word? This is not how BPE works.

I don't know. Both Hugging Face's and OpenAI's implementations in the respective repos are buggy.

The vocab file itself was trained, and it was created using the BPE implementation (or big picture tokenization implementation) that was actually correct and is not in the OpenAI repo. It would have only emitted those vocab entries if they were used.

if the original tokenizer works this way, there is no way for us to change this as we try to reproduce the results.

The original tokenizer doesn't work this way. The OpenAI clip repo's tokenizer isn't the one they used to tokenize the training set their CLIP was trained on. We don't know how the original tokenizer works, but probably not this one with bugs and no tests.

What issue did this actually have on your training? CLIP powers all the stable diffusion models which are very very strong

LoRAs on the text encoder for Stability's models will perform worse if the tokenization of captions used for the enthusiast's training doesn't match the tokenization used by Stability.

ArthurZucker commented 7 months ago

Hey, I hope the following threads help you. I was not around when the tokenizer was published, but we did have such feedback for the past years so highly doubt there's a real issue!

doctorpangloss commented 7 months ago

Thanks for the links.

Do you think this Python code in OpenAI/clip is the one that was used to tokenize the dataset to train CLIP?

Here are the reasons it could be the code that was used to train CLIP:

Here are some reasons, I speculate, it (OpenAI's code) was not used to train CLIP:

You could go and implement "BPE" from the paper. Maybe you'll publish tests for it, and maintain it. Maybe there will be users of it, from OpenAI, who will write bugs from when they use the tokenization on real data.

However, the opposite is happening, it's amateurs who are using this because it's free. It's people who don't deal with large datasets or whose "KPIs" don't matter, so there aren't outcomes that matter, so they run the code and they get an output and none of it is material. None of it is materially measured. Then you are put in the uncomfortable position, supposing you are doing the default stance that the issue is not a real issue, "How do I delicately tell someone that they don't know what they are talking about?" And then you yourself might be unsure, you're a reasonable person, you didn't invent CLIP and you don't talk to the guy who did, nor to the PMs at OpenAI, or anyone.

So maybe find the real implementation. It isn't really material if this one correctly implements something material.

ArthurZucker commented 7 months ago

Regarding the performances, there's https://github.com/instant-labs/instant-clip-tokenizer which seems to be super efficient. I'll try to see what can be taken from it. Also Openai now uses the tiktoken library , but it was not around at the time.

Unfortunately I am not super sure what are the steps you suggest, but the slow version of the BPE (which is in transformers) is known to be slow, which is why we have the Fast version (automatically used with AutoTokenizer).

doctorpangloss commented 7 months ago

Unfortunately I am not super sure what are the steps you suggest,

Some options:

  1. Contact someone at OpenAI to share the actual implementation they used to train CLIP, which will never happen. Not a super satisfactory answer, no. I agree that writing tokenization libraries is a grind.

  2. Compare to their likely authoritative implementation. I don't know what the authoritative implementation of BPE is. Probably sentencepiece. Maybe OpenAI forked it to make the UTF-8 / Unicode adjustment they do in their CLIP repo. Based on my reading of https://github.com/openai/CLIP/blob/a1d071733d7111c9c014f024669f959182114e33/clip/simple_tokenizer.py#L76, the BPE scores are implied by the order of texts in the merges file. It's not clear how this is done with the vocab.json dicts that Hugging Face ships. It is possible to convert the merges file into a .model proto that sentencepiece uses, assigning scores based on the order of text pieces in the merges. That is still iffy, but maybe that will get closer to the implementation used for training.

  3. Add a fuzzer and tests to this library to discover all the bugs.

Either way, the results of Hugging Face's implementation versus OpenAI's are different.

why this matters

If the tokenizations differ slightly, all sorts of things go wrong:

I think you already know all of this. I can't tell you directly if the issue I reported is real. I don't know, I am not an expert. But I am observing buggy behavior nonetheless. We can ticket it as something else.

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

doctorpangloss commented 6 months ago

Unfortunately I am not super sure what are the steps you suggest, but the slow version of the BPE (which is in transformers) is known to be slow, which is why we have the Fast version (automatically used with AutoTokenizer).

I'm bumping the issue because I'm just some guy. If someone, like you, with a @huggingface email account, messages the Stability team for the actual, bonafide tokenizer implementation they used to create the tokenized captions in their proprietary dataset, you will have solved this issue. You are more experienced in this than I am, so you can imagine the impacts, the mistakes / flaws when the tokenizer used for the training differs from the one used for fine tuning.

I am essentially saying that it doesn't matter what the bugs are, so long as the behavior is exactly consistent with the training dataset, and that I have found evidence that this BPE vocab file was adapted from the actual implementation, it doesn't even have the priorities, so it's almost certainly too buggy to match the behavior of training.

doctorpangloss commented 2 months ago

pinging this issue again

if the original tokenizer works this way, there is no way for us to change this as we try to reproduce the results.

The original tokenizer did not work this way. I guess that's all you need to know.