mlfoundations / open_clip

An open source implementation of CLIP.
Other
9.93k stars 959 forks source link

Add `get_logits` method and NLLB tokenizer #756

Closed visheratin closed 10 months ago

visheratin commented 10 months ago

Hi!

I want to make OpenCLIP more usable for downstream applications, like zero-shot classification. Right now, to get the logits, the user has to call encode_image and encode_text, then matmul them, multiply the result by logit_scale, and optionally add logit_bias. I think it makes sense to have one method to get logits, as in OpenAI and HuggingFace. So I added the get_logits method to both CLIP and CustomTextCLIP classes.

I also added the NLLBTokenizer class that has an additional langs parameter in the __call__ method. This is needed because the tokenizer for NLLB models adds a language token to the beginning of the sequence. The token that is added is controlled via the set_src_lang_special_tokens method. If the language is not set via this method, the tokenizer will add an English token to all sequences.

The PR also contains some formatting changes performed by Ruff. For me personally, the formatted code looks nicer, but if you don't like it, I can roll it back.

@gabrielilharco @rwightman

rwightman commented 10 months ago

@visheratin can we do this without blacking it and changing more lines than being added?

rwightman commented 10 months ago

also, should probably be rebased against latest

visheratin commented 10 months ago

@rwightman Left only my changes and rebased.

rwightman commented 10 months ago

So, for the tokenizer, I'm not convinced it warrants a new tokenizer and associated maintenance. Isn't it pretty standard for multi-lingual to manually insert the language token per text? Are there any popular impl which do it this way?

rwightman commented 10 months ago

On the get_logits, it's useful to have, but will point out this won't work with torchcompile or FSDP which only wrap forward() methods.

visheratin commented 10 months ago

Regarding the tokenizer, with add_special_tokens set to True by default, M2M100Tokenizer adds language token automatically. If the user specifies the token in the text without calling set_src_lang_special_tokens, there will be two language tokens in the output, e.g., tensor([[256147, 256006, 718, 1159, ...]]) (256147 for English and 256006 for Afrikaans).

When making an additional tokenizer, I tried to look at the problem from the end-user perspective. With the current HFTokenizer, if the user wants to use an NLLB-based model, they'd have to figure out a proper way to encode the inputs on their own. Having an alternative implementation with the langs parameter and a warning helps to indicate what needs to be done to get proper tokens for the model.

visheratin commented 10 months ago

Regarding get_logits and torchcompile or FSDP, I understand the limitation. The main usage of the method would be to do something like:

from open_clip import create_model_and_transforms, get_tokenizer
from PIL import Image

model, _, transform = create_model_and_transforms(...)
tokenizer = get_tokenizer(...)

text_inputs = tokenizer(...)
image = Image.open(...)
image_inputs = transform(image)
logits_per_image, logits_per_text = model.get_logits(image_inputs, text_inputs)

An example of such usage can be found in roboflow/supervision library.

rwightman commented 10 months ago

@visheratin rather than make a whole new tokenizer for this with a different tokenize interface, couldn't we pass through the set_src_lang methods and/or src lang init kwargs? assert/report error if it's called on a underlying HF tokenizer that doesn't have it?

visheratin commented 10 months ago

Passing through the set_src_lang method is an option. But it will still be non-obvious for the user that they may need to use this method if they want to use languages other than English. Also, if we go this way, we make the users implement the logic that is now in the NLLBTokenizer themselves if they have multilingual input.

Maybe multilingual input texts is too edge case. If you think so, I can remove NLLBTokenizer from the PR. If the user wants to change the source language, they can always call tokenizer.tokenizer.set_src_lang, as it is done in the CLIP benchmark.

rwightman commented 10 months ago

Passing through the set_src_lang method is an option. But it will still be non-obvious for the user that they may need to use this method if they want to use languages other than English. Also, if we go this way, we make the users implement the logic that is now in the NLLBTokenizer themselves if they have multilingual input.

Maybe multilingual input texts is too edge case. If you think so, I can remove NLLBTokenizer from the PR. If the user wants to change the source language, they can always call tokenizer.tokenizer.set_src_lang, as it is done in the CLIP benchmark.

Isn't this how the HF tokenizers work though? I think they have some sort of specific tokenization method with src/target lang as args, but usually it's either set on construction of the tokenizer or via the set method no?

visheratin commented 10 months ago

In the case of M2M100Tokenizer, the src token is set as a prefix token, which doesn't require a specific tokenization method.

I removed the NLLBTokenizer and added logic for setting the language on init to the HFTokenizer.