penghao-wu / vstar

PyTorch Implementation of "V* : Guided Visual Search as a Core Mechanism in Multimodal LLMs"
https://vstar-seal.github.io/
MIT License
497 stars 32 forks source link

Error: Missing preprocessor_config.json in craigwu/seal_vsm_7b Repository #15

Closed camilosys3 closed 2 months ago

camilosys3 commented 3 months ago

Hello!

I am trying to run the code from the repository on Google Colab. I managed to clone the repository and install the requirements, but when I try to instantiate the models with:

vsm_model_url = "craigwu/seal_vsm_7b"
vqa_model_url = "craigwu/seal_vqa_7b"

I get the following error:

OSError: craigwu/seal_vsm_7b does not appear to have a file named preprocessor_config.json. Check 'https://huggingface.co/craigwu/seal_vsm_7b/tree/main' for available files.

Indeed, the repository does not have the specified file, which prevents me from progressing. My goal is to run the code that, given an image and a prompt, displays bounding boxes and their coordinates for the object indicated in the prompt. So far, I have not been successful. I would appreciate any guidance on how to proceed or if it is possible to achieve this with the current version of the repository. Thanks!

penghao-wu commented 3 months ago

Hi, can you provide the minimal code snippet to reproduce this and also your transformers version?

camilosys3 commented 3 months ago

Sure! Here is the minimal code snippet up to the error: link

Initially, I installed version 4.31.0 of transformers as specified in the requirements.txt file. This resulted in a KeyError when trying to load the visual search model and processor using AutoProcessor.from_pretrained with the URL craigwu/seal_vsm_7b. This error suggests that the necessary configuration files (config.json and tokenizer_config.json) are missing or incorrectly structured in the specified repository, preventing proper model instantiation.

To resolve the KeyError, I updated transformers to the latest version (4.41.2). However, this led to a new error indicating that the preprocessor_config.json file is missing from the craigwu/seal_vsm_7b repository when attempting to download the necessary configuration files during code execution.

If possible, I would greatly appreciate any guidance on which code to execute to input an image along with a prompt to find an object, and output the coordinates of the bounding box of the found object. I believe it should be vstar/visual_search.py or one of the classes in vstar/VisualSearch/model/VSM.py. I apologize if my question is basic or unclear. I am new to this field but found your work fascinating and decided to explore further to evaluate the capabilities of this mechanism on different types of images.

penghao-wu commented 3 months ago

Thanks for providing your information. For your case, you only need the visual_search.py. To load the visual search model, you can refer to visual_search.py. In this script, we initialize a class VSM where the vsm_tokenizer and vsm_model are initialized. And the main function in this script just shows how to use it for visual search given an image and an object name.

camilosys3 commented 2 months ago

Thank you very much! I was able to run vstar_bench.py with my own images, but the results were below expectations, likely due to my own poor implementation. Then I tried with the images from vstar_bench/GPT4V-hard and the results were as expected!

Now I am iterating to adjust the considerations required for my own images to be analyzed by the pretrained models. In this regard, I believe I do not need the .json file, but instead, just strings indicating the prompt for the object of interest in the image. Is this correct? One of the tests I want to conduct is counting the number of objects in an image (for example, soccer balls). Are additional considerations needed for this type of question, or should the prompt and the current code implementation suffice?

Thanks again for your help!

penghao-wu commented 2 months ago

Our visual search model is mainly designed to handle single-instance cases for now. We focus on the visual search efficiency to find the target instance within the minimal steps. If you want to exhaustively count the number of a target category, it then needs to search over each position in the image to make sure every target instance is counted. In such cases, the search algorithm becomes unnecessary and you have to exhaustively conduct detection on each crop of the image to find every target instance.