Adaptive search modalities by index.

twelvelabs-io / tl-jockey

Jockey is a conversational video agent.

51 stars 13 forks source link

Adaptive search modalities by index. #42

Closed TravisCouture closed 1 month ago

TravisCouture commented 4 months ago

Currently, search worker may attempt to call the search API with search modalities that are not valid for that index. The default error response may list modalities that are in conflict with the ENUM set for the field (visual & conversation are defined but text_in_video and logo are not). We need to implement an adaptive way of accounting for this and either provide a better error response so Jockey can adapt accordingly or automatically cleanse incoming requests that contain invalid modalities.

DmitriiTsy commented 2 months ago

I started using this prompt for testing purposes because the main idea is to trigger the search worker to use incorrect modalities that we don't have:

'use index [TwelveLabsIndex] to find the top 2 clips of touchdowns using text_in_video.'

DmitriiTsy commented 2 months ago

And we will have "It appears there has been a repeated error in the search process. The search was again conducted using the "conversation" modality instead of the "text_in_video" modality as requested. Please ensure that the search is conducted using the correct modality, "text_in_video", to fulfill the user's request" at the end as a confirmation

DmitriiTsy commented 2 months ago

Here is a couple more examples for testing purposes (prompts)

Try to perform a search using an invalid modality like "text_in_video" on the same index. Report how the system handles this invalid modality
Perform a search using both visual and conversation modalities on the index [TwelveLabsIndexID]. Report the results and any modalities used.

DmitriiTsy commented 2 months ago

This prompt is also extremely useful from the testing perspective here:

Using index [ourIndexID], search for clips that contain both spoken dialogue about football and on-screen text showing game scores. Use visual, conversation, and text_in_video modalities for this search. Find the top 3 clips where all these elements appear together. For each clip, provide the timestamp, a brief description of what's being said, and the text that appears on screen.