octimot / StoryToolkitAI

An editing tool that uses AI to transcribe, understand content and search for anything in your footage, integrated with ChatGPT and other AI models
GNU General Public License v3.0
644 stars 52 forks source link

Additional Zero Shot Models #171

Open mkammes opened 3 months ago

mkammes commented 3 months ago

Is your feature request related to a problem? Please describe. No.

Describe the solution you'd like Additional Zero Shot models; such as Grounding DINO. Maybe Detectron2 or Segment Anything. However, Grounding DINO - which is promptable - would be great.

Describe alternatives you've considered n/a

Additional context The Grounding DINO model is promptable and apparently scores higher than CLIP.

octimot commented 3 months ago

Hey there!

I think Segment Anything / Grounding DINO are creating more restrictive embeddings due to their promptable nature (more focused training data). In other words, CLIP on its own allows you to search using more "obscure" language, while others might be restricted to more common words (car, sky, bird, face etc.)

We're preparing an update which also allows the use of GPT-Vision and LLaVA-like models that would allow you to ingest and prompt directly too.

But, I'll take a look at these too ASAP!

Cheers