Closed camilosys3 closed 2 months ago
Hi, can you provide the minimal code snippet to reproduce this and also your transformers version?
Sure! Here is the minimal code snippet up to the error: link
Initially, I installed version 4.31.0 of transformers
as specified in the requirements.txt
file. This resulted in a KeyError when trying to load the visual search model and processor using AutoProcessor.from_pretrained
with the URL craigwu/seal_vsm_7b
. This error suggests that the necessary configuration files (config.json
and tokenizer_config.json
) are missing or incorrectly structured in the specified repository, preventing proper model instantiation.
To resolve the KeyError, I updated transformers
to the latest version (4.41.2). However, this led to a new error indicating that the preprocessor_config.json
file is missing from the craigwu/seal_vsm_7b
repository when attempting to download the necessary configuration files during code execution.
If possible, I would greatly appreciate any guidance on which code to execute to input an image along with a prompt to find an object, and output the coordinates of the bounding box of the found object. I believe it should be vstar/visual_search.py
or one of the classes in vstar/VisualSearch/model/VSM.py
. I apologize if my question is basic or unclear. I am new to this field but found your work fascinating and decided to explore further to evaluate the capabilities of this mechanism on different types of images.
Thanks for providing your information. For your case, you only need the visual_search.py
. To load the visual search model, you can refer to visual_search.py
. In this script, we initialize a class VSM
where the vsm_tokenizer
and vsm_model
are initialized. And the main
function in this script just shows how to use it for visual search given an image and an object name.
Thank you very much! I was able to run vstar_bench.py with my own images, but the results were below expectations, likely due to my own poor implementation. Then I tried with the images from vstar_bench/GPT4V-hard and the results were as expected!
Now I am iterating to adjust the considerations required for my own images to be analyzed by the pretrained models. In this regard, I believe I do not need the .json file, but instead, just strings indicating the prompt for the object of interest in the image. Is this correct? One of the tests I want to conduct is counting the number of objects in an image (for example, soccer balls). Are additional considerations needed for this type of question, or should the prompt and the current code implementation suffice?
Thanks again for your help!
Our visual search model is mainly designed to handle single-instance cases for now. We focus on the visual search efficiency to find the target instance within the minimal steps. If you want to exhaustively count the number of a target category, it then needs to search over each position in the image to make sure every target instance is counted. In such cases, the search algorithm becomes unnecessary and you have to exhaustively conduct detection on each crop of the image to find every target instance.
Hello!
I am trying to run the code from the repository on Google Colab. I managed to clone the repository and install the requirements, but when I try to instantiate the models with:
I get the following error:
Indeed, the repository does not have the specified file, which prevents me from progressing. My goal is to run the code that, given an image and a prompt, displays bounding boxes and their coordinates for the object indicated in the prompt. So far, I have not been successful. I would appreciate any guidance on how to proceed or if it is possible to achieve this with the current version of the repository. Thanks!