If I have a set of images, can I provide the verbs - for example, this image is of a person surfing, or, this image is of a person holding a surfboard and I want to cluster the images based on the actions. (i.e, I provide the verbs). Also, how many images are required? Can I provide 1000 images and expect it to cluster well?
I provide the verbs, but instead of images, can I use a vector (embeddings) as input?
IC|TC can accommodate any criteria expressible in text, so using verbs is also possible. Regarding the number of data, I cannot guarantee a specific amount, but I believe that 1000 images should be sufficient.
It would be possible if we finetuned the VLM and LLM to understand embeddings. Alternatively, converting text criteria (i.e., verbs) into embeddings and then effectively interacting these with the image embeddings for clustering is also possible. This could be an interesting future work.
Hi,
I had a question about the application:
Thank You.