Open HaisongDing opened 1 year ago
I notice that you directly use OpenAI's GPT-4 to match caption and grounded entity. Why not train a custom model by leveraging existing datasets like the ones used in KOSMOS-2 or Shikra?
Follow-up question, why not directly use GLIP or G-DINO?
I notice that you directly use OpenAI's GPT-4 to match caption and grounded entity. Why not train a custom model by leveraging existing datasets like the ones used in KOSMOS-2 or Shikra?