Open zucchini-nlp opened 4 months ago
cc @molbap @NielsRogge , I added only models I see commonly to the list and all VLMs to unblock the pipeline
I can take CLIP and LLaVa
@davidgxue okey, feel free to open a PR when it's ready.
I would like to work on BLIP-2. Just to clarify we only need to change BLIP-2 right and not BLIP? Because there is a comment which mentions
# Copied from transformers.models.blip.processing_blip.BlipProcessor.__call__
@OmarManzoor my bad, forgot to add Blip to the list. You can work on Blip and all changes from BLIP will be ported to BLIP2 automatically :)
I'll add Blip to the list and assign to you then
@OmarManzoor @zucchini-nlp missed it, I already started work on a few models here. Please check the original PRs here https://github.com/huggingface/transformers/pull/31198 and here https://github.com/huggingface/transformers/pull/31368 , BLIP, BLIP-2, Donut and a couple more are already handled
Please check the original PRs here https://github.com/huggingface/transformers/pull/31198 and here https://github.com/huggingface/transformers/pull/31368 , BLIP, BLIP-2, Donut and a couple more are already handled
Thank you for clarifying.
@leloykun this is one of the trackers we have for start. There'a another PR for standardizing VLM from generation perspective. And unfortunately other tasks will be blocked by these.
If you want to work on this task or maybe in making a wrapper for VLMTokenizer, let me know!
thanks @zucchini-nlp!
I can take LayoutLM (1, 2, 3) & Chameleon
I can take owlv2 and vit. But for owlv2 there are multiple helper functions and classes that are being copied from owlvit, so does that mean i need to work on owlvit?
# Copied from transformers.models.owlvit.processing_owlvit.OwlViTProcessor.__call__ with OWLViT->OWLv2
Can I take dino and Paligemma if no ones working on it?
@bhuvanmdev yes, if owlvit processing code is identical to owl, it will be simply copied from
@MnCSSJ4x sure
@zucchini-nlp I started working on Paligemma and tried to follow the PRs mentioned here. In paligemma there is not test file for the processor. Do I need to add those tests (if that's the case, please point me to how can I do that) and also can that be used directly to check if the changes are non breaking? I can raise a temp PR so that we can discuss it there.
@MnCSSJ4x yes, feel free to open a PR so that we can discuss it there. And yes, in that case we need to add the test file and ProcessorTesterMixin
to it, so that new changes are all tested
@MnCSSJ4x yes, feel free to open a PR so that we can discuss it there. And yes, in that case we need to add the test file and
ProcessorTesterMixin
to it, so that new changes are all tested
Thanks I have created a PR #32377 and tagged you there. Please let me know how can I get started regarding testing the same.
I can work on the remaining image-text-to-text models (Fuyu, Idefics/2, InstructBlip, Kosmos-2, LLaVa-NeXT) as I have already been working on their processors for https://github.com/huggingface/transformers/pull/32471 .
@yonigozlan Thanks, that would be great!
Also added Udop to the list and to #32544 .
Hey, I can take ClipSeg.
Hi @davidgxue and @MnCSSJ4x, I was wondering if you've made any recent progress on uniformizing LLaVa and Paligemma. They're some of the last image-text-to-text models left to uniformize, and it would be great to get them done to unblock the image-text-to-text pipeline. If not, I can help get them finished, as they should be similar to other image-text-to-text models that are being uniformized. Thanks!
Hi @davidgxue and @MnCSSJ4x, I was wondering if you've made any recent progress on uniformizing LLaVa and Paligemma. They're some of the last image-text-to-text models left to uniformize, and it would be great to get them done to unblock the image-text-to-text pipeline. If not, I can help get them finished, as they should be similar to other image-text-to-text models that are being uniformized. Thanks!
Hey. I had written a draft PR for Paligemma. I was blocked with some curriculum work due to which I wasn't able to resolve the comments as well as write the test suite. You can start LLaVa if no one else has taken it. Also, feel free to ping me in that PR as I also need some help in writing tests and understanding the reference codebase.
@yonigozlan I am so sorry about the delay. I made some progress locally a while ago, but then completely forgot about this github issue. Will try to get it out today or tomorrow. Will keep you posted if anything changes.
Hey @yonigozlan actually, feel free to work on LLAVA. I realized I previously made progress on CLIP not LLAVA. And I am also getting a lot of work lately, so probably won't be able to help out on that one. Sorry about that! CLIP will come around tho!
Btw, DepthAnything, Dino, Maskformer, & VIT don't have processors
@zucchini-nlp here's for the rest of the processors: https://github.com/huggingface/transformers/pull/32845
@leloykun WOW, thanks a lot! I can review those on Monday, today will be a bit busy
Thanks too!
For now, the PR for Chameleon, https://github.com/huggingface/transformers/pull/32181, is the safest to merge as (1) the processor doesn't expect special args (e.g. text_pair
and such) and (2) the PR also already has tests
The PRs for Nougat https://github.com/huggingface/transformers/pull/32841 and the LayoutLM models https://github.com/huggingface/transformers/pull/32180 need more thought as their processors expect special args (cc @yonigozlan, I think @molbap is right that our current implementation is kinda wonky)
while the other PRs don't have tests yet
No problem @davidgxue! I will get started on LLaVa then
Summary of my progress:
https://github.com/huggingface/transformers/pull/32181 & #32845 are now ready for review. I could also decouple InstructBlipVideo, LLaVa-NeXT-Video, Siglip, and VideoLLaVa to a separate PR just to get them out--just lemme know if there's a need to rush.
The rest have special args that we still have to figure out how to handle. cc @yonigozlan
Update: all of the PRs are now ready for review
Processors with weird output keys:
Model | weird key to expected key mapping |
---|---|
LayoutLMv2 | image -> pixel_values |
LayoutXLM | image -> pixel_values |
MGP | labels -> input_ids |
Nougat | labels -> input_ids |
TVP | pixel_values -> pixel_values_videos |
VideoLlava | pixel_values_images -> pixel_values |
X-Clip | pixel_values -> pixel_values_videos |
@molbap @zucchini-nlp what do you think of standardizing/uniformizing these too?
Models missing from this list:
task_inputs
& segmentation_maps
; has some weird logic with task_inputs
)segmentation_maps
, input_points
, input_labels
, & input_boxes
)_in_target_context_manager
logic; is this deprecated?)The rest either doesn't have an image processor or only has a feature_extractor_class
@molbap @yonigozlan I wanna raise this here so our implementations would be more consistent:
I've implemented a saner way to handle special processor call args for backwards compatibility that doesn't involve re-using unrelated args (e.g. audio
& video
) and doesn't need extra arguments like backwards_compatibility_placeholder_arg
.
Tl;dr: I followed @molbap's advise to capture them using *args
instead and added prepare_and_validate_optional_call_args
to auto convert them to kwargs which we can then pass to _merge_kwargs
along with the other arguments.
See my implementation here https://github.com/huggingface/transformers/pull/32841 and here https://github.com/huggingface/transformers/pull/32180
For everyone else: the "special arguments" here are arguments that carry data from user input, but aren't named text
, images
, audio
, nor videos
nor are configs to the tokenizer, image processor, etc. For example, the bounding boxes for LayoutLM* models, the visual prompt for ClipSeg, etc.
The problem with these args is that some users pass them as positional arguments to the processors. So, if we wanna restrict the processor call arguments to only those four and kwargs
, then we're gonna have problems handling these special arguments.
Now, we only need to do:
ModelProcessorKwargs
class (same as before)optional_call_args = [...]
as an attribute to the processor class*args,
to where in the call signature the special arguments were before (e.g. for LayoutLM* models, it's right after text
and images
)**self.prepare_and_validate_optional_call_args(*args),
as an argument to self._merge_kwargs
. E.g.:output_kwargs = self._merge_kwargs(
CLIPSegProcessorKwargs,
tokenizer_init_kwargs=self.tokenizer.init_kwargs,
**kwargs,
**self.prepare_and_validate_optional_call_args(*args),
)
Alternatively, I could move prepare_and_validate_optional_call_args
to _merge_kwargs
and perhaps rename it to _merge_args_and_kwargs
. This way, the interface would be something like this instead:
output_kwargs = self._merge_args_and_kwargs(
CLIPSegProcessorKwargs,
tokenizer_init_kwargs=self.tokenizer.init_kwargs,
*args,
**kwargs,
)
Lemme know what you think
That looks great to me @leloykun thanks a lot for working on that! I personally like it as you've done it in https://github.com/huggingface/transformers/pull/32841 and https://github.com/huggingface/transformers/pull/32180 with self.prepare_and_validate_optional_call_args(*args),
as it makes it easy to remove once this behavior is fully deprecated. For that reason I don't think we should replace _merge_kwargs
by a _merge_args_and_kwargs
.
If others agree, I will add this to Udop and other models I have been working on if they need them.
For me it looks an good workaround, and should handle BC correctly. I left comments in the corresponding PRs, and I agree about renaming _merge_kwargs
.
I'll leave some general questions/comments here, as there are so many PRs currently and I might miss some of them. So, as part of standardization we have to
image, text
in processors. I see confusion around why we're swapping these. Since I wasn't very closely working on standardization, my guess is that it's a common pattern from most processors and might help with pipelines. @yonigozlan if you have any ideas as you've been working closely with @molbap BatchFeature
and not BatchEncoding
in processor's call, as was done in some PRs already. I don't see any reason why BatchEncoding
so this shouldn't cause issuesv4.47
@yonigozlan @molbap I don't see any good reason why we should swap the args. It just adds complications with backwards compatibility. And besides, we can just pass them as keyword arguments in the pipelines.
To me, swapping the args is part of the effort to standardize the processors. Right now, we have some VLMs where the processor takes images first and others where it takes text first. Even if we discourage using positional arguments for processor calls, I imagine most users will still use them for images and text. Having to guess which comes first depending on the model doesn't seem ideal.
That said, since the argument swapping isn't a blocker for the image-text-to-text pipeline, I suggest we remove it for now to get this merged as soon as possible. The backward compatibility is indeed tricky to manage, and the current checks feel a bit too hacky for a merge. We can open a separate PR later to focus on argument swapping if we decide it's necessary.
How does that sound, @zucchini-nlp, @leloykun?
@yonigozlan yup, that sounds good.
btw, @yonigozlan @zucchini-nlp @molbap what do you think of forcing the use of keyword arguments in future versions? I.e. having this signature?
def __call__(
self,
*,
text: ...,
images: ...,
audio: ...,
videos: ...,
**kwargs: Unpack[...],
) -> BatchFeature:
@leloykun This is something we could consider for e.g. v5 of transformers but I wouldn't enforce it at the moment as we're going through minor versions: this would break a lot of code for a lot of people.
Afaik the order is swapped only in some VLMs and we want to follow image, text
order in the end. Indeed there are many BC handlings happening, but swapping positions doesn't seem very hard to handle. Also, it won't flood cmd with warnings because processors with swapped order usually have no other cases to handle.
So, I am for changing order of args as part of standardization, and address comments if there are any to make the checks more reliable. If you are more comfortable with having a separate PR, I'm okay with separating out VLMs from current PR and opening a new one.
What I suggest for faster iteration is to first review and merge one model, for ex, LLaVa has a PR for itself now. After everyone is happy with it, we can merge and copy changes in other models. Same for handling "special positional args", I guess @leloykun had a PR with one model only.
@zucchini-nlp Sounds good! I will probably do a separate PR for each model that needs the argument switch, just because it adds a lot of noisy changes in tests, docstrings etc. I will also work on finding a better way to support BC and implement it first in the LLaVa PR. Will ping you when that's done :)
Also started uniformization of kwargs of processors of audio-text models here: https://github.com/huggingface/transformers/pull/32906
It's still a draft on top of https://github.com/huggingface/transformers/pull/32845 but SpeechT5 & Wav2Vec2 Bert should be done now
hmmm... now that I think about it... why is it audio
instead of audios
?
@leloykun I chose audio
instead of audios
because the second one, despite being a valid English word, is barely used. videos
is valid because it is a shortcut for videotapes
. So videos
is the countable form, but there is no much sense in having a countable form for audio
, so the uncountable plural form audio
is kept.
Feature request
We want to standardize the logic flow through Processor classes. Since processors can have different kwargs depending on the model and modality, we are adding a
TypedDict
for each modality to keep track of which kwargs are accepted.The initial design is merged and an example model is modified to follow the new uniform processor kwargs in https://github.com/huggingface/transformers/pull/31198. Also https://github.com/huggingface/transformers/pull/31197 has two more examples with standardized API.
This design has to be shipped to all the processors in Transformers, and appreciate contributions. Below is an incomplete list of models that need standardization, feel free to add a model if it's missing:
Note: For now we'll start with image or image+text, https://github.com/huggingface/transformers/pull/31368 is an ongoing PR that has also audio processor standardization
Motivation
.
Your contribution
.