Closed yonigozlan closed 8 hours ago
@yonigozlan yes, indeed, totally forgot about this quirk when we have both image and video processors and they don't share same set of config attributes. We can either swap the order, as it is working now in llava-next-video or add unused config attributes to video processor so that it gets saved and loaded. Since we want to be consistent across repo and the current order is a result of PR comment, I'd say adding unused attr in config is a better solution, @amyeroberts wdyt?
Not sure if we want to support different attr for video and image currently though and I don't see a warkaround for that. But I also opened a Feature Request at #33504, which I've been thinking for a while and seems we can benefit from it now.
Thanks for flagging @yonigozlan!
This is tricky and quiet a broad processors issue I expect to be a more common problem as multi-modality becomes increasingly popular.
OK, so the way this has been solved in other processors which have more than one processing class of the same type is to do something pretty hacky in which one of the class' configs is saved out to a separate subfolder. For example, in InstructBLIP we have a tokenizer
and qformer_tokenizer
which we then have to handle in custom save_pretrained and from_pretrained. This obviously doesn't scale as a solution, however something like this is what I'd suggest for a quick fix for LlavaOneVision, as both the image processor and videoprocessor should have their own configs. Config files are independent and the resulting loaded classes should be fully loadable from a single config file separate from any other classes which it may be bundled in with in the processor, nor its order in the processor.
I'd say adding unused attr in config is a better solution, @amyeroberts wdyt?
Containing unused arguments in a config file is both confusing and can lead to future issues if any arguments are added or modified in the future. It creates a dependency between the two processing classes which we should try to avoid. I also don't think it solves the issue of shared attributes between the classes e.g. would this work for Yoni's example where we want to set rescale_factor
to different values for the two classes?
Regarding #33504 -- having a separate VideoProcessor I think is a good idea, however I don't think it will fully address the two issues we currently have as:
preprocessor_config.json
. So any processor which contains more than one of these classes will hit this clashA few possible ideas:
preprocessor_config.json
for config files which use the processing classes' name i.e. use "feature_extractor_config.json" instead. This will solve this issue but not the instructblip issue. I'd like this to happen anyway, as it's clearer what the file will contain when exploring the hub.My preference would be for 1. + 3. as it involves the least amount of conditional logic and handling --> easier to test and more forwards compatible, but v. open to rebuttals and alternative suggestions.
A few questions to be resolved:
preprocessor_config.json
is probably fine. If we have subfolders, do these take precedence? qformer_tokenizer_config.json
how do we know how to load it?@amyeroberts
the image processor and videoprocessor should have their own configs [and then] Feature extractors, image processors (and possibly the video processors) all save to preprocessor_config.json
Agreed here, that is what I want to achieve with the linked issue in long term, each one in its config file. I see that now we have only one preprocessing config per model but I believe soon we'll have some image+video+audio models and would need to save them separately. Currently we can start from sorting out video modality, because we already have several models added to transformers
I like the idea to enable conditional naming of config files with prefixes
but not a fan of separate folders as it seems like each folder will end up with one config file, in most common cases. So a video_processor
folder can hold video_preprocessing_config.json
and so on. Cmiiw.
Backwards compatibility -- how to make sure old loading still works? Falling back to loading preprocessor_config.json is probably fine. If we have subfolders, do these take precedence?
In case of prefixing each config file, we don't have to support much BC and will give priority to specific configs with prefixes if they exist. Otherwise fallback to the default one w/o prefixing. Maybe emitting a deprecation/warning/info message for users
If a processing has its config saved out to a different file name e.g. qformer_tokenizer_config.json how do we know how to load it? > What happens if a processor is adapted to add a new processing class e.g. it had originally one tokenizer and another one is added in a later release?
I am not very sure having two tokenizers is a common thing and qformers seems more of an exception than a rule. Maybe we can leave instructblip as it is and separate out modality-specific configs only, one file per modality. Same way one python file per modality with their own processor class
If we have subfolders for e.g. an image processor and tokenizer, how do we know from the auto classes where to look to load the right files? How should e.g. AutoImageProcessor behave when loading from a checkpoint with two image processors? Maybe we can just raise an error and say a subfolder has to be specified (simple + no magic)?
I can't say much about audio models, from what I see they are in FeatureProcessor
mapping and I believe the main idea was to use feature processor for audios and deprecate it for image models? And my idea is to add a new mapping AutoVideoProcessor
since videos will inherit from their own class, overriding some methods and keeping others from image processing. And when loading with auto, users would need to AutoFeatureExtractor.from_pretrained()
, AutoVideoProcessor.from_pretrained()
or AutoImageProcessor.from_pretrained()
. We'll check out the new configs for each and if not found load from the general config w/o prefix as above.
Deprecate preprocessor_config.json for config files which use the processing classes' name i.e. use "feature_extractor_config.json" instead.
Didn't really get this part. You mean end up with only feature_extractor_config
for all cases? Isn't feature extractor under deprecation until v5.0 in some models like CLIP?
Containing unused arguments in a config file is both confusing and can lead to future issues if any arguments are added or modified in the future.
Unfortunately yes, but imo the other solution of adding a new config file for video processor is going to be a log process, so this can be quick workaround. To make the processor loadable after saving
I like the idea to enable conditional naming of config files with prefixes but not a fan of separate folders as it seems like each folder will end up with one config file, in most common cases. So a video_processor folder can hold video_preprocessing_config.json and so on. Cmiiw.
Nope - you're right. Happy to go with that.
I can't say much about audio models, from what I see they are in FeatureProcessor mapping and I believe the main idea was to use feature processor for audios and deprecate it for image models?
Yep.
Didn't really get this part. You mean end up with only feature_extractor_config for all cases? Isn't feature extractor under deprecation until v5.0 in some models like CLIP?
Sorry, I wasn't sure clear here. I meant that a feature extractor for an audio model would just use feature_extractor_config.json
. Any vision models which previously had a feature extractor class shouldn't have this config file added.
We might end up in the situation when there's two config files for a class. For example, I load an image processor using an existing preprocessor_config.json
file, and then save it out again and push to the hub. There would now be preprocessor_config.json
and image_processor_config.json
. In which case, we'll have to think about how to handle this (maybe just emit logging.info that preprocessor_config.json
was detected by loading from the more recent file. I don't think we want a warning.
For the current LLaVA-OneVIsion case, as it is already bugged and more important to fix
Can't we do something like in instructblip where we override from_pretrained
and save_pretrained
to load a different file or from a subfolder? I agree we want to fix this as soon as possible, but I'm not comfortable with the dual-config solution which relies on some arguments being used.
There would now be preprocessor_config.json and image_processor_config.json. In which case, we'll have to think about how to handle this (maybe just emit logging.info that preprocessor_config.json was detected by loading from the more recent file. I don't think we want a warning.
Ah I see now. Yes, agreed that we should give higher priority to the preprocessor_config.json
and then when saving back save it with the image_preprocessor_config.json
. I think for saving we don't need to log anything, but for loading part sure
Can't we do something like in instructblip where we override from_pretrained and save_pretrained to load a different file or from a subfolder? I agree we want to fix this as soon as possible, but I'm not comfortable with the dual-config solution which relies on some arguments being used.
Oh yeah, that works, didn't think about that hehe. I will make a PR for that and will need to update files on the hub later
Yes, agreed that we should give higher priority to the preprocessor_config.json and then when saving back save it with the image_preprocessor_config.json
I think we'd want to put higher priority on image_preprocessor_config.json
if both exist -- as that will be the more recent config file
System Info
Who can help?
@zucchini-nlp @amyeroberts
Reproduction
If we save an instance of LlavaOnevisionProcessor such as:
tmpdirname = tempfile.mkdtemp()
and then try to load it with AutoImageProcessor:
Variant of the problem:
Expected behavior
Expected: no ambiguity on whether we are loading the
image_processor
or thevideo_processor
config/class.It looks like both
image_processor
andvideo_processor
configs are saved to the samepreprocessor_config.json
file as they both inherit fromBaseImageProcessor
, and thus overwrite each other when saved.