Saving and loading `LlavaOnevisionProcessor` results in unexpected behavior

System Info

transformers 4.45.0.dev0 (up-to-date with github repo)

Who can help?

@zucchini-nlp @amyeroberts

Reproduction

If we save an instance of LlavaOnevisionProcessor such as:

tmpdirname = tempfile.mkdtemp()

image_processor = LlavaOnevisionImageProcessor()
video_processor = LlavaOnevisionVideoProcessor()
tokenizer = Qwen2TokenizerFast.from_pretrained("Qwen/Qwen2-0.5B-Instruct")

processor = LlavaOnevisionProcessor(
    video_processor=video_processor, image_processor=image_processor, tokenizer=tokenizer
)
processor.save_pretrained(tmpdirname)

and then try to load it with AutoImageProcessor:

>>> image_processor = AutoImageProcessor.from_pretrained(tmpdirname)
>>> print(image_processor.__class__.__name__)
LlavaOnevisionVideoProcessor

Variant of the problem:

tmpdirname = tempfile.mkdtemp()
image_processor = LlavaOnevisionImageProcessor(rescale_factor=10)
video_processor = LlavaOnevisionVideoProcessor(rescale_factor=5)
tokenizer = Qwen2TokenizerFast.from_pretrained("Qwen/Qwen2-0.5B-Instruct")

processor = LlavaOnevisionProcessor(
    video_processor=video_processor, image_processor=image_processor, tokenizer=tokenizer
)
processor.save_pretrained(tmpdirname)

>>> image_processor = LlavaOnevisionImageProcessor.from_pretrained(tmpdirname)
>>> print(image_processor.rescale_factor)
5

Expected behavior

Expected: no ambiguity on whether we are loading the image_processor or the video_processor config/class.

It looks like both image_processor and video_processor configs are saved to the same preprocessor_config.json file as they both inherit from BaseImageProcessor, and thus overwrite each other when saved.

@yonigozlan yes, indeed, totally forgot about this quirk when we have both image and video processors and they don't share same set of config attributes. We can either swap the order, as it is working now in llava-next-video or add unused config attributes to video processor so that it gets saved and loaded. Since we want to be consistent across repo and the current order is a result of PR comment, I'd say adding unused attr in config is a better solution, @amyeroberts wdyt?

Not sure if we want to support different attr for video and image currently though and I don't see a warkaround for that. But I also opened a Feature Request at #33504, which I've been thinking for a while and seems we can benefit from it now.

Thanks for flagging @yonigozlan!

This is tricky and quiet a broad processors issue I expect to be a more common problem as multi-modality becomes increasingly popular.

OK, so the way this has been solved in other processors which have more than one processing class of the same type is to do something pretty hacky in which one of the class' configs is saved out to a separate subfolder. For example, in InstructBLIP we have a tokenizer and qformer_tokenizer which we then have to handle in custom save_pretrained and from_pretrained. This obviously doesn't scale as a solution, however something like this is what I'd suggest for a quick fix for LlavaOneVision, as both the image processor and videoprocessor should have their own configs. Config files are independent and the resulting loaded classes should be fully loadable from a single config file separate from any other classes which it may be bundled in with in the processor, nor its order in the processor.

I'd say adding unused attr in config is a better solution, @amyeroberts wdyt?

Containing unused arguments in a config file is both confusing and can lead to future issues if any arguments are added or modified in the future. It creates a dependency between the two processing classes which we should try to avoid. I also don't think it solves the issue of shared attributes between the classes e.g. would this work for Yoni's example where we want to set rescale_factor to different values for the two classes?

Regarding #33504 -- having a separate VideoProcessor I think is a good idea, however I don't think it will fully address the two issues we currently have as:

Feature extractors, image processors (and possibly the video processors) all save to preprocessor_config.json. So any processor which contains more than one of these classes will hit this clash
Any processor containing more than one of the same processing class e.g. tokenizers will have a config clash

A few possible ideas:

Always save processing classes' files out to subfolders related to their name attribute e.g. llava one vision checkpoints would have three subfolders called "image_processor", "tokenizer", "video_processor" respectively
If more than one class of the same type is contained in the processor, then save those processors to subfolders. If done, I'd prefer that this was done for all of the classes. For example, in instuctblip, there's no good reason one tokenizer's files should be at the top level, and the other in a subfolder.
Deprecate preprocessor_config.json for config files which use the processing classes' name i.e. use "feature_extractor_config.json" instead. This will solve this issue but not the instructblip issue. I'd like this to happen anyway, as it's clearer what the file will contain when exploring the hub.
Enable conditional naming of config files, depending on the processor's attributes e.g. for instructblip all of the files for the qformertokenizer would have a `qformer` prefix. This is probably quite easy to bundle with 3. This would involve having to extract the prefix and conditionally apply the naming to the files. Bonus is there's no subfolders.

My preference would be for 1. + 3. as it involves the least amount of conditional logic and handling --> easier to test and more forwards compatible, but v. open to rebuttals and alternative suggestions.

A few questions to be resolved:

Backwards compatibility -- how to make sure old loading still works? Falling back to loading preprocessor_config.json is probably fine. If we have subfolders, do these take precedence?
If a processing has its config saved out to a different file name e.g. qformer_tokenizer_config.json how do we know how to load it?
If we have subfolders for e.g. an image processor and tokenizer, how do we know from the auto classes where to look to load the right files?
What happens if a processor is adapted to add a new processing class e.g. it had originally one tokenizer and another one is added in a later release?
How should e.g. AutoImageProcessor behave when loading from a checkpoint with two image processors? Maybe we can just raise an error and say a subfolder has to be specified (simple + no magic)?

@amyeroberts

For the general discussion on separating out modality-specific configs and classes

the image processor and videoprocessor should have their own configs [and then] Feature extractors, image processors (and possibly the video processors) all save to preprocessor_config.json

Agreed here, that is what I want to achieve with the linked issue in long term, each one in its config file. I see that now we have only one preprocessing config per model but I believe soon we'll have some image+video+audio models and would need to save them separately. Currently we can start from sorting out video modality, because we already have several models added to transformers

I like the idea to enable conditional naming of config files with prefixes but not a fan of separate folders as it seems like each folder will end up with one config file, in most common cases. So a video_processor folder can hold video_preprocessing_config.json and so on. Cmiiw.

Backwards compatibility -- how to make sure old loading still works? Falling back to loading preprocessor_config.json is probably fine. If we have subfolders, do these take precedence?

In case of prefixing each config file, we don't have to support much BC and will give priority to specific configs with prefixes if they exist. Otherwise fallback to the default one w/o prefixing. Maybe emitting a deprecation/warning/info message for users

If a processing has its config saved out to a different file name e.g. qformer_tokenizer_config.json how do we know how to load it? > What happens if a processor is adapted to add a new processing class e.g. it had originally one tokenizer and another one is added in a later release?

I am not very sure having two tokenizers is a common thing and qformers seems more of an exception than a rule. Maybe we can leave instructblip as it is and separate out modality-specific configs only, one file per modality. Same way one python file per modality with their own processor class

If we have subfolders for e.g. an image processor and tokenizer, how do we know from the auto classes where to look to load the right files? How should e.g. AutoImageProcessor behave when loading from a checkpoint with two image processors? Maybe we can just raise an error and say a subfolder has to be specified (simple + no magic)?

I can't say much about audio models, from what I see they are in FeatureProcessor mapping and I believe the main idea was to use feature processor for audios and deprecate it for image models? And my idea is to add a new mapping AutoVideoProcessor since videos will inherit from their own class, overriding some methods and keeping others from image processing. And when loading with auto, users would need to AutoFeatureExtractor.from_pretrained(), AutoVideoProcessor.from_pretrained() or AutoImageProcessor.from_pretrained(). We'll check out the new configs for each and if not found load from the general config w/o prefix as above.

Deprecate preprocessor_config.json for config files which use the processing classes' name i.e. use "feature_extractor_config.json" instead.

Didn't really get this part. You mean end up with only feature_extractor_config for all cases? Isn't feature extractor under deprecation until v5.0 in some models like CLIP?

For the current LLaVA-OneVIsion case, as it is already bugged and more important to fix

Containing unused arguments in a config file is both confusing and can lead to future issues if any arguments are added or modified in the future.

Unfortunately yes, but imo the other solution of adding a new config file for video processor is going to be a log process, so this can be quick workaround. To make the processor loadable after saving

I like the idea to enable conditional naming of config files with prefixes but not a fan of separate folders as it seems like each folder will end up with one config file, in most common cases. So a video_processor folder can hold video_preprocessing_config.json and so on. Cmiiw.

Nope - you're right. Happy to go with that.

I can't say much about audio models, from what I see they are in FeatureProcessor mapping and I believe the main idea was to use feature processor for audios and deprecate it for image models?

Yep.

Didn't really get this part. You mean end up with only feature_extractor_config for all cases? Isn't feature extractor under deprecation until v5.0 in some models like CLIP?

Sorry, I wasn't sure clear here. I meant that a feature extractor for an audio model would just use feature_extractor_config.json. Any vision models which previously had a feature extractor class shouldn't have this config file added.

We might end up in the situation when there's two config files for a class. For example, I load an image processor using an existing preprocessor_config.json file, and then save it out again and push to the hub. There would now be preprocessor_config.json and image_processor_config.json. In which case, we'll have to think about how to handle this (maybe just emit logging.info that preprocessor_config.json was detected by loading from the more recent file. I don't think we want a warning.

For the current LLaVA-OneVIsion case, as it is already bugged and more important to fix

Can't we do something like in instructblip where we override from_pretrained and save_pretrained to load a different file or from a subfolder? I agree we want to fix this as soon as possible, but I'm not comfortable with the dual-config solution which relies on some arguments being used.

There would now be preprocessor_config.json and image_processor_config.json. In which case, we'll have to think about how to handle this (maybe just emit logging.info that preprocessor_config.json was detected by loading from the more recent file. I don't think we want a warning.

Ah I see now. Yes, agreed that we should give higher priority to the preprocessor_config.json and then when saving back save it with the image_preprocessor_config.json. I think for saving we don't need to log anything, but for loading part sure

Can't we do something like in instructblip where we override from_pretrained and save_pretrained to load a different file or from a subfolder? I agree we want to fix this as soon as possible, but I'm not comfortable with the dual-config solution which relies on some arguments being used.

Oh yeah, that works, didn't think about that hehe. I will make a PR for that and will need to update files on the hub later

Yes, agreed that we should give higher priority to the preprocessor_config.json and then when saving back save it with the image_preprocessor_config.json

I think we'd want to put higher priority on image_preprocessor_config.json if both exist -- as that will be the more recent config file

huggingface / transformers