Closed visheratin closed 10 months ago
@visheratin can we do this without blacking it and changing more lines than being added?
also, should probably be rebased against latest
@rwightman Left only my changes and rebased.
So, for the tokenizer, I'm not convinced it warrants a new tokenizer and associated maintenance. Isn't it pretty standard for multi-lingual to manually insert the language token per text? Are there any popular impl which do it this way?
On the get_logits, it's useful to have, but will point out this won't work with torchcompile or FSDP which only wrap forward() methods.
Regarding the tokenizer, with add_special_tokens
set to True
by default, M2M100Tokenizer adds language token automatically. If the user specifies the token in the text without calling set_src_lang_special_tokens
, there will be two language tokens in the output, e.g., tensor([[256147, 256006, 718, 1159, ...]])
(256147 for English and 256006 for Afrikaans).
When making an additional tokenizer, I tried to look at the problem from the end-user perspective. With the current HFTokenizer
, if the user wants to use an NLLB-based model, they'd have to figure out a proper way to encode the inputs on their own. Having an alternative implementation with the langs
parameter and a warning helps to indicate what needs to be done to get proper tokens for the model.
Regarding get_logits
and torchcompile or FSDP, I understand the limitation. The main usage of the method would be to do something like:
from open_clip import create_model_and_transforms, get_tokenizer
from PIL import Image
model, _, transform = create_model_and_transforms(...)
tokenizer = get_tokenizer(...)
text_inputs = tokenizer(...)
image = Image.open(...)
image_inputs = transform(image)
logits_per_image, logits_per_text = model.get_logits(image_inputs, text_inputs)
An example of such usage can be found in roboflow/supervision library.
@visheratin rather than make a whole new tokenizer for this with a different tokenize interface, couldn't we pass through the set_src_lang methods and/or src lang init kwargs? assert/report error if it's called on a underlying HF tokenizer that doesn't have it?
Passing through the set_src_lang
method is an option. But it will still be non-obvious for the user that they may need to use this method if they want to use languages other than English. Also, if we go this way, we make the users implement the logic that is now in the NLLBTokenizer
themselves if they have multilingual input.
Maybe multilingual input texts is too edge case. If you think so, I can remove NLLBTokenizer
from the PR. If the user wants to change the source language, they can always call tokenizer.tokenizer.set_src_lang
, as it is done in the CLIP benchmark.
Passing through the
set_src_lang
method is an option. But it will still be non-obvious for the user that they may need to use this method if they want to use languages other than English. Also, if we go this way, we make the users implement the logic that is now in theNLLBTokenizer
themselves if they have multilingual input.Maybe multilingual input texts is too edge case. If you think so, I can remove
NLLBTokenizer
from the PR. If the user wants to change the source language, they can always calltokenizer.tokenizer.set_src_lang
, as it is done in the CLIP benchmark.
Isn't this how the HF tokenizers work though? I think they have some sort of specific tokenization method with src/target lang as args, but usually it's either set on construction of the tokenizer or via the set method no?
In the case of M2M100Tokenizer
, the src token is set as a prefix token, which doesn't require a specific tokenization method.
I removed the NLLBTokenizer
and added logic for setting the language on init to the HFTokenizer
.
Hi!
I want to make OpenCLIP more usable for downstream applications, like zero-shot classification. Right now, to get the logits, the user has to call
encode_image
andencode_text
, then matmul them, multiply the result bylogit_scale
, and optionally addlogit_bias
. I think it makes sense to have one method to get logits, as in OpenAI and HuggingFace. So I added theget_logits
method to bothCLIP
andCustomTextCLIP
classes.I also added the
NLLBTokenizer
class that has an additionallangs
parameter in the__call__
method. This is needed because the tokenizer for NLLB models adds a language token to the beginning of the sequence. The token that is added is controlled via theset_src_lang_special_tokens
method. If the language is not set via this method, the tokenizer will add an English token to all sequences.The PR also contains some formatting changes performed by Ruff. For me personally, the formatted code looks nicer, but if you don't like it, I can roll it back.
@gabrielilharco @rwightman