sehyunkwon / ICTC

This is a public repository for Image Clustering Conditioned on Text Criteria (IC|TC)
Apache License 2.0
76 stars 3 forks source link

Question on application #6

Closed anjugopinath closed 2 months ago

anjugopinath commented 4 months ago

Hi,

I had a question about the application:

  1. If I have a set of images, can I provide the verbs - for example, this image is of a person surfing, or, this image is of a person holding a surfboard and I want to cluster the images based on the actions. (i.e, I provide the verbs). Also, how many images are required? Can I provide 1000 images and expect it to cluster well?
  2. I provide the verbs, but instead of images, can I use a vector (embeddings) as input?

Thank You.

sehyunkwon commented 4 months ago

Hello!

  1. IC|TC can accommodate any criteria expressible in text, so using verbs is also possible. Regarding the number of data, I cannot guarantee a specific amount, but I believe that 1000 images should be sufficient.
  2. It would be possible if we finetuned the VLM and LLM to understand embeddings. Alternatively, converting text criteria (i.e., verbs) into embeddings and then effectively interacting these with the image embeddings for clustering is also possible. This could be an interesting future work.

Thanks for your questions!