Uniform kwargs for processors

zucchini-nlp commented 4 months ago

Feature request

We want to standardize the logic flow through Processor classes. Since processors can have different kwargs depending on the model and modality, we are adding a TypedDict for each modality to keep track of which kwargs are accepted.

The initial design is merged and an example model is modified to follow the new uniform processor kwargs in https://github.com/huggingface/transformers/pull/31198. Also https://github.com/huggingface/transformers/pull/31197 has two more examples with standardized API.

This design has to be shipped to all the processors in Transformers, and appreciate contributions. Below is an incomplete list of models that need standardization, feel free to add a model if it's missing:

[x] Align #31368
[x] AltClip #31368
[x] BLIP #31368
[x] BLIP-2 #31368
[x] Bridgetower #31368
[x] Chameleon -> https://github.com/huggingface/transformers/pull/32181
[x] Chinese CLIP -> #31368
[x] CLIP -> in progress by @davidgxue
[x] ClipSeg -> https://github.com/huggingface/transformers/pull/32841
[x] Donut #31368
[x] Flava -> https://github.com/huggingface/transformers/pull/32845
[x] Fuyu -> https://github.com/huggingface/transformers/pull/32544
[x] GIT https://github.com/huggingface/transformers/pull/33668
[x] Grounding DINO #31964
[x] Idefics -> https://github.com/huggingface/transformers/pull/32568
[x] Idefics-2 -> https://github.com/huggingface/transformers/pull/32568
[x] InstructBlip -> https://github.com/huggingface/transformers/pull/32544
[x] InstructBlipVideo https://github.com/huggingface/transformers/pull/32845
[x] Kosmos-2 -> https://github.com/huggingface/transformers/pull/32544
[x] LayoutLM (1, 2, 3) -> https://github.com/huggingface/transformers/pull/32180
[x] LLaVa -> https://github.com/huggingface/transformers/pull/32858
[x] LLaVa-NeXT -> https://github.com/huggingface/transformers/pull/32544
[x] LLaVa-NeXT-Video https://github.com/huggingface/transformers/pull/32845
[x] MGP-STR https://github.com/huggingface/transformers/pull/32845
[x] Nouga -> https://github.com/huggingface/transformers/pull/32841
[x] OneFormer -> https://github.com/huggingface/transformers/pull/34547
[x] Owlv2 https://github.com/huggingface/transformers/pull/32841
[x] OwlVIT https://github.com/huggingface/transformers/pull/32841
[x] Paligemma -> https://github.com/huggingface/transformers/pull/33571
[x] Pix2Struct -> https://github.com/huggingface/transformers/pull/32544
[x] Pixtral -> https://github.com/huggingface/transformers/pull/33521
[x] SAM -> https://github.com/huggingface/transformers/pull/34578
[x] SigLip -> https://github.com/huggingface/transformers/pull/32845
[ ] TrOCR
[x] TVP -> https://github.com/huggingface/transformers/pull/32845
[x] Udop -> https://github.com/huggingface/transformers/pull/33628
[x] VideoLLaVa -> https://github.com/huggingface/transformers/pull/32845
[x] VILT -> https://github.com/huggingface/transformers/pull/32845
[x] VisionTextDualEncoder -> https://github.com/huggingface/transformers/pull/34563
[x] X-CLIP -> https://github.com/huggingface/transformers/pull/32845

Note: For now we'll start with image or image+text, https://github.com/huggingface/transformers/pull/31368 is an ongoing PR that has also audio processor standardization

Motivation

.

Your contribution

.

zucchini-nlp commented 4 months ago

cc @molbap @NielsRogge , I added only models I see commonly to the list and all VLMs to unblock the pipeline

davidgxue commented 4 months ago

I can take CLIP and LLaVa

zucchini-nlp commented 4 months ago

@davidgxue okey, feel free to open a PR when it's ready.

OmarManzoor commented 4 months ago

I would like to work on BLIP-2. Just to clarify we only need to change BLIP-2 right and not BLIP? Because there is a comment which mentions

# Copied from transformers.models.blip.processing_blip.BlipProcessor.__call__

zucchini-nlp commented 4 months ago

@OmarManzoor my bad, forgot to add Blip to the list. You can work on Blip and all changes from BLIP will be ported to BLIP2 automatically :)

I'll add Blip to the list and assign to you then

molbap commented 4 months ago

@OmarManzoor @zucchini-nlp missed it, I already started work on a few models here. Please check the original PRs here https://github.com/huggingface/transformers/pull/31198 and here https://github.com/huggingface/transformers/pull/31368 , BLIP, BLIP-2, Donut and a couple more are already handled

OmarManzoor commented 4 months ago

Please check the original PRs here https://github.com/huggingface/transformers/pull/31198 and here https://github.com/huggingface/transformers/pull/31368 , BLIP, BLIP-2, Donut and a couple more are already handled

Thank you for clarifying.

zucchini-nlp commented 4 months ago

@leloykun this is one of the trackers we have for start. There'a another PR for standardizing VLM from generation perspective. And unfortunately other tasks will be blocked by these.

If you want to work on this task or maybe in making a wrapper for VLMTokenizer, let me know!

leloykun commented 4 months ago

thanks @zucchini-nlp!

I can take LayoutLM (1, 2, 3) & Chameleon

bhuvanmdev commented 3 months ago

I can take owlv2 and vit. But for owlv2 there are multiple helper functions and classes that are being copied from owlvit, so does that mean i need to work on owlvit?

# Copied from transformers.models.owlvit.processing_owlvit.OwlViTProcessor.__call__ with OWLViT->OWLv2

MnCSSJ4x commented 3 months ago

Can I take dino and Paligemma if no ones working on it?

zucchini-nlp commented 3 months ago

@bhuvanmdev yes, if owlvit processing code is identical to owl, it will be simply copied from

@MnCSSJ4x sure

MnCSSJ4x commented 3 months ago

@zucchini-nlp I started working on Paligemma and tried to follow the PRs mentioned here. In paligemma there is not test file for the processor. Do I need to add those tests (if that's the case, please point me to how can I do that) and also can that be used directly to check if the changes are non breaking? I can raise a temp PR so that we can discuss it there.

zucchini-nlp commented 3 months ago

@MnCSSJ4x yes, feel free to open a PR so that we can discuss it there. And yes, in that case we need to add the test file and ProcessorTesterMixin to it, so that new changes are all tested

MnCSSJ4x commented 3 months ago

@MnCSSJ4x yes, feel free to open a PR so that we can discuss it there. And yes, in that case we need to add the test file and ProcessorTesterMixin to it, so that new changes are all tested

Thanks I have created a PR #32377 and tagged you there. Please let me know how can I get started regarding testing the same.

yonigozlan commented 3 months ago

I can work on the remaining image-text-to-text models (Fuyu, Idefics/2, InstructBlip, Kosmos-2, LLaVa-NeXT) as I have already been working on their processors for https://github.com/huggingface/transformers/pull/32471 .

zucchini-nlp commented 3 months ago

@yonigozlan Thanks, that would be great!

yonigozlan commented 3 months ago

Also added Udop to the list and to #32544 .

Nech-C commented 3 months ago

Hey, I can take ClipSeg.

yonigozlan commented 3 months ago

Hi @davidgxue and @MnCSSJ4x, I was wondering if you've made any recent progress on uniformizing LLaVa and Paligemma. They're some of the last image-text-to-text models left to uniformize, and it would be great to get them done to unblock the image-text-to-text pipeline. If not, I can help get them finished, as they should be similar to other image-text-to-text models that are being uniformized. Thanks!

MnCSSJ4x commented 3 months ago

Hi @davidgxue and @MnCSSJ4x, I was wondering if you've made any recent progress on uniformizing LLaVa and Paligemma. They're some of the last image-text-to-text models left to uniformize, and it would be great to get them done to unblock the image-text-to-text pipeline. If not, I can help get them finished, as they should be similar to other image-text-to-text models that are being uniformized. Thanks!

Hey. I had written a draft PR for Paligemma. I was blocked with some curriculum work due to which I wasn't able to resolve the comments as well as write the test suite. You can start LLaVa if no one else has taken it. Also, feel free to ping me in that PR as I also need some help in writing tests and understanding the reference codebase.

davidgxue commented 3 months ago

@yonigozlan I am so sorry about the delay. I made some progress locally a while ago, but then completely forgot about this github issue. Will try to get it out today or tomorrow. Will keep you posted if anything changes.

davidgxue commented 3 months ago

Hey @yonigozlan actually, feel free to work on LLAVA. I realized I previously made progress on CLIP not LLAVA. And I am also getting a lot of work lately, so probably won't be able to help out on that one. Sorry about that! CLIP will come around tho!

leloykun commented 3 months ago

Since nobody has claimed them yet, dibs on Nougat and SigLip

They should now be ready for review + they already have backwards compatibility support

leloykun commented 3 months ago

Btw, DepthAnything, Dino, Maskformer, & VIT don't have processors

leloykun commented 3 months ago

@zucchini-nlp here's for the rest of the processors: https://github.com/huggingface/transformers/pull/32845

zucchini-nlp commented 3 months ago

@leloykun WOW, thanks a lot! I can review those on Monday, today will be a bit busy

leloykun commented 3 months ago

Thanks too!

For now, the PR for Chameleon, https://github.com/huggingface/transformers/pull/32181, is the safest to merge as (1) the processor doesn't expect special args (e.g. text_pair and such) and (2) the PR also already has tests

The PRs for Nougat https://github.com/huggingface/transformers/pull/32841 and the LayoutLM models https://github.com/huggingface/transformers/pull/32180 need more thought as their processors expect special args (cc @yonigozlan, I think @molbap is right that our current implementation is kinda wonky)

while the other PRs don't have tests yet

yonigozlan commented 3 months ago

No problem @davidgxue! I will get started on LLaVa then

leloykun commented 3 months ago

Summary of my progress:

Model	Status	Has Tests?	Special Args	PR
Chameleon	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32181
---	---	---	---	---
AltCLIP	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
Flava	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
Git	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
InstructBlipVideo	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
LLaVa-NeXT-Video	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
MGP	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
Siglip	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
TVP	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
VideoLLaVa	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
VILT	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
X-CLIP	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32845
---	---	---	---	---
LayoutLMv2	Ready	Yes	~~`text_pair`~~, `boxes`, `word_labels`	https://github.com/huggingface/transformers/pull/32180
LayoutLMv3	Ready	Yes	~~`text_pair`~~, `boxes`, `word_labels`	https://github.com/huggingface/transformers/pull/32180
LayoutXLM	Ready	Yes	~~`text_pair`~~, `boxes`, `word_labels`	https://github.com/huggingface/transformers/pull/32180
---	---	---	---	---
ClipSeg	Ready	Yes	`visual_prompt`	https://github.com/huggingface/transformers/pull/32841
Nougat	Ready	Yes	~~`text_pair`, `text_target`, `text_pair_target`~~(apparently these are in the tokenizer base class)	https://github.com/huggingface/transformers/pull/32841
OwlV2	Ready	Yes	`query_images`	https://github.com/huggingface/transformers/pull/32841
OwlVIT	Ready	Yes	`query_images`	https://github.com/huggingface/transformers/pull/32841
---	---	---	---	---
Clap	Not Ready	No	??	https://github.com/huggingface/transformers/pull/32906
CLVP	Not Ready	No	??	https://github.com/huggingface/transformers/pull/32906
MusicGen Melody	Not Ready	No	??	https://github.com/huggingface/transformers/pull/32906
PopPiano	Not Ready	No	??	https://github.com/huggingface/transformers/pull/32906
Qwen2 Audio	Not Ready	No	??	https://github.com/huggingface/transformers/pull/32906
Seamless M4T	Not Ready	No	??	https://github.com/huggingface/transformers/pull/32906
SpeechT5	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32906
Wav2Vec2 Bert	Ready	Yes	-	https://github.com/huggingface/transformers/pull/32906

https://github.com/huggingface/transformers/pull/32181 & #32845 are now ready for review. I could also decouple InstructBlipVideo, LLaVa-NeXT-Video, Siglip, and VideoLLaVa to a separate PR just to get them out--just lemme know if there's a need to rush.

~~The rest have special args that we still have to figure out how to handle. cc @yonigozlan~~

Update: all of the PRs are now ready for review

leloykun commented 3 months ago

Processors with weird output keys:

Model	weird key to expected key mapping
LayoutLMv2	`image` -> `pixel_values`
LayoutXLM	`image` -> `pixel_values`
MGP	`labels` -> `input_ids`
Nougat	`labels` -> `input_ids`
TVP	`pixel_values` -> `pixel_values_videos`
VideoLlava	`pixel_values_images` -> `pixel_values`
X-Clip	`pixel_values` -> `pixel_values_videos`

@molbap @zucchini-nlp what do you think of standardizing/uniformizing these too?

leloykun commented 3 months ago

Models missing from this list:

[x] AltClip
[x] Flava
[x] GIT
[x] MGP
[ ] OneFormer (special args: task_inputs & segmentation_maps; has some weird logic with task_inputs)
[ ] SAM (special args: segmentation_maps, input_points, input_labels, & input_boxes)
[ ] TrOCR (has some weird _in_target_context_manager logic; is this deprecated?)
[x] TVP
[ ] VisionTextDualEncoder
[x] ViLT
[x] X-CLIP

The rest either doesn't have an image processor or only has a feature_extractor_class

leloykun commented 3 months ago

@molbap @yonigozlan I wanna raise this here so our implementations would be more consistent:

I've implemented a saner way to handle special processor call args for backwards compatibility that doesn't involve re-using unrelated args (e.g. audio & video) and doesn't need extra arguments like backwards_compatibility_placeholder_arg.

Tl;dr: I followed @molbap's advise to capture them using *args instead and added prepare_and_validate_optional_call_args to auto convert them to kwargs which we can then pass to _merge_kwargs along with the other arguments.

See my implementation here https://github.com/huggingface/transformers/pull/32841 and here https://github.com/huggingface/transformers/pull/32180

For everyone else: the "special arguments" here are arguments that carry data from user input, but aren't named text, images, audio, nor videos nor are configs to the tokenizer, image processor, etc. For example, the bounding boxes for LayoutLM* models, the visual prompt for ClipSeg, etc.

The problem with these args is that some users pass them as positional arguments to the processors. So, if we wanna restrict the processor call arguments to only those four and kwargs, then we're gonna have problems handling these special arguments.

Now, we only need to do:

Add the ModelProcessorKwargs class (same as before)
Add optional_call_args = [...] as an attribute to the processor class
Add *args, to where in the call signature the special arguments were before (e.g. for LayoutLM* models, it's right after text and images)
Add **self.prepare_and_validate_optional_call_args(*args), as an argument to self._merge_kwargs. E.g.:

output_kwargs = self._merge_kwargs(
    CLIPSegProcessorKwargs,
    tokenizer_init_kwargs=self.tokenizer.init_kwargs,
    **kwargs,
    **self.prepare_and_validate_optional_call_args(*args),
)

Alternatively, I could move prepare_and_validate_optional_call_args to _merge_kwargs and perhaps rename it to _merge_args_and_kwargs. This way, the interface would be something like this instead:

output_kwargs = self._merge_args_and_kwargs(
    CLIPSegProcessorKwargs,
    tokenizer_init_kwargs=self.tokenizer.init_kwargs,
    *args,
    **kwargs,
)

Lemme know what you think

yonigozlan commented 3 months ago

That looks great to me @leloykun thanks a lot for working on that! I personally like it as you've done it in https://github.com/huggingface/transformers/pull/32841 and https://github.com/huggingface/transformers/pull/32180 with self.prepare_and_validate_optional_call_args(*args), as it makes it easy to remove once this behavior is fully deprecated. For that reason I don't think we should replace _merge_kwargs by a _merge_args_and_kwargs. If others agree, I will add this to Udop and other models I have been working on if they need them.

zucchini-nlp commented 3 months ago

For me it looks an good workaround, and should handle BC correctly. I left comments in the corresponding PRs, and I agree about renaming _merge_kwargs.

I'll leave some general questions/comments here, as there are so many PRs currently and I might miss some of them. So, as part of standardization we have to

use order as image, text in processors. I see confusion around why we're swapping these. Since I wasn't very closely working on standardization, my guess is that it's a common pattern from most processors and might help with pipelines. @yonigozlan if you have any ideas as you've been working closely with @molbap
IMO we have to return BatchFeature and not BatchEncoding in processor's call, as was done in some PRs already. I don't see any reason why BatchEncoding so this shouldn't cause issues
use the above proposed way to deprecate extra args and add a major version after which positional args won;t be supported, that can be v4.47

leloykun commented 3 months ago

@yonigozlan @molbap I don't see any good reason why we should swap the args. It just adds complications with backwards compatibility. And besides, we can just pass them as keyword arguments in the pipelines.

yonigozlan commented 3 months ago

To me, swapping the args is part of the effort to standardize the processors. Right now, we have some VLMs where the processor takes images first and others where it takes text first. Even if we discourage using positional arguments for processor calls, I imagine most users will still use them for images and text. Having to guess which comes first depending on the model doesn't seem ideal.

That said, since the argument swapping isn't a blocker for the image-text-to-text pipeline, I suggest we remove it for now to get this merged as soon as possible. The backward compatibility is indeed tricky to manage, and the current checks feel a bit too hacky for a merge. We can open a separate PR later to focus on argument swapping if we decide it's necessary.

How does that sound, @zucchini-nlp, @leloykun?

leloykun commented 3 months ago

@yonigozlan yup, that sounds good.

btw, @yonigozlan @zucchini-nlp @molbap what do you think of forcing the use of keyword arguments in future versions? I.e. having this signature?

def __call__(
    self,
    *,
    text: ...,
    images: ...,
    audio: ...,
    videos: ...,
    **kwargs: Unpack[...],
) -> BatchFeature:

amyeroberts commented 3 months ago

@leloykun This is something we could consider for e.g. v5 of transformers but I wouldn't enforce it at the moment as we're going through minor versions: this would break a lot of code for a lot of people.

zucchini-nlp commented 3 months ago

Afaik the order is swapped only in some VLMs and we want to follow image, text order in the end. Indeed there are many BC handlings happening, but swapping positions doesn't seem very hard to handle. Also, it won't flood cmd with warnings because processors with swapped order usually have no other cases to handle.

So, I am for changing order of args as part of standardization, and address comments if there are any to make the checks more reliable. If you are more comfortable with having a separate PR, I'm okay with separating out VLMs from current PR and opening a new one.

What I suggest for faster iteration is to first review and merge one model, for ex, LLaVa has a PR for itself now. After everyone is happy with it, we can merge and copy changes in other models. Same for handling "special positional args", I guess @leloykun had a PR with one model only.

yonigozlan commented 3 months ago

@zucchini-nlp Sounds good! I will probably do a separate PR for each model that needs the argument switch, just because it adds a lot of noisy changes in tests, docstrings etc. I will also work on finding a better way to support BC and implement it first in the LLaVa PR. Will ping you when that's done :)

leloykun commented 2 months ago

Also started uniformization of kwargs of processors of audio-text models here: https://github.com/huggingface/transformers/pull/32906

It's still a draft on top of https://github.com/huggingface/transformers/pull/32845 but SpeechT5 & Wav2Vec2 Bert should be done now

leloykun commented 2 months ago

hmmm... now that I think about it... why is it audio instead of audios?

molbap commented 2 months ago

@leloykun I chose audio instead of audios because the second one, despite being a valid English word, is barely used. videos is valid because it is a shortcut for videotapes. So videos is the countable form, but there is no much sense in having a countable form for audio, so the uncountable plural form audio is kept.

huggingface / transformers