VidSage focuses on processing video data, storing it in Azure AI services, and enabling advanced local and global querying through techniques - Azure AI Search (Native RAG), Graph-based Retrieval (Graph RAG), Open AI CLIP Model (Image Embeddings), Azure GPT-4o.
Introduction
VidSage provides detailed business insights of videos using Azure AI Search, Advanced Graph RAG capability to analyze all the videos.
Platform intelligent multi-modal chunking strategy helps it to point to the exact section in the video where a particular topic is discussed.
Architecture
The architecture consists of several stages:
Video Upload: Videos are uploaded to the repository.
Processing: Extract text using Azure Speech-to-Text (STT) service with speaker diarization and image keyframes from the videos.
Transcript Enhancement:
Text transcripts are enhanced with keyframe descriptions using Azure OpenAI GPT-4o.
Embedding Creation:
Text embeddings are generated using the Azure OpenAI Ada embedding model.
Image embeddings are generated using OpenAI CLIP model.
Azure AI Search:
Store text embeddings in a text index.
Store image embeddings in an image index.
GRAPH RAG:
Graph database to create a graph for our enhanced transcripts.
For the GraphRAG we use advanced agentic chunking to convert all the sentences in a transcript to standalone sentences and then chunk the transcripts into relevant and meaningful chunks using GPT 4o mini. These chunks are connected to Video node.
For any video, we extract all the entities and relationships along with it, we create a Video node and summary node which contains video text transcript, Summary of the transcript as well as all the topics, features, issues, speakers and sentiment of the video.
Whenever a new video gets uploaded we use entity disambiguation to ensure that the entities with similar name and meaning are not repeated.
Graph is structured in a way that any point of time it represents the overall discussions happening through all the videos processed by the platform. This helps the Graph RAG to better answer queries compared to native RAG. Native RAG will be able to answer based only on the chunks retrieved which may miss out the overall knowledge representation.
Storage: Enhanced text transcripts and image keyframes are stored in Azure Vector Index for efficient retrieval.
Querying
Local Querying
Local querying is performed for questions based on a specific video.
Native Retrieval-Augmented Generation (RAG): Uses Azure AI Search to retrieve relevant text chunks and image keyframes related to the query.
Response Generation: The retrieved information is passed through Azure GPT-4o to generate answers.
Global Querying
Global querying is performed across the entire video repository, including summary-based questions.
Graph RAG: Extracts relevant nodes from the graph using vector search and graph traversal.
Response Generation: Passes the structured data to Azure GPT-4o to generate a detailed response.
Features
Speaker Diarization: Distinguish between multiple speakers in the video transcripts.
Keyframe Extraction: Extract image keyframes to associate with text data.
Advanced Embeddings: Use OpenAI models for generating text and image embeddings.
Graph Database Integration: Store and retrieve data in a structured graph format using Graph RAG.
Entity Disambiguation: Avoid repetition of entities with similar names and meanings.
Local and Global Querying: Retrieve information specific to a video or across the entire video repository.
Technology Stack
Azure AI Search for text and image indexes
Azure Speech-to-Text (STT) with speaker diarization
Azure OpenAI (Ada embedding model, GPT-4o)
OpenAI CLIP for image embeddings
Graph RAG for graph-based retrieval
Entity and Relationship Extraction for knowledge graph construction
Project Name
VidSage
Description
VidSage: Video Insights using Graph RAG
https://www.youtube.com/watch?v=IUSCWtB9jWk
VidSage focuses on processing video data, storing it in Azure AI services, and enabling advanced local and global querying through techniques - Azure AI Search (Native RAG), Graph-based Retrieval (Graph RAG), Open AI CLIP Model (Image Embeddings), Azure GPT-4o.
Introduction
VidSage provides detailed business insights of videos using Azure AI Search, Advanced Graph RAG capability to analyze all the videos. Platform intelligent multi-modal chunking strategy helps it to point to the exact section in the video where a particular topic is discussed.
Architecture
The architecture consists of several stages:
Video Upload: Videos are uploaded to the repository.
Processing: Extract text using Azure Speech-to-Text (STT) service with speaker diarization and image keyframes from the videos.
Transcript Enhancement:
Embedding Creation:
Azure AI Search:
GRAPH RAG:
Storage: Enhanced text transcripts and image keyframes are stored in Azure Vector Index for efficient retrieval.
Querying
Local Querying
Local querying is performed for questions based on a specific video.
Global Querying
Global querying is performed across the entire video repository, including summary-based questions.
Features
Technology Stack
Technology & Languages
Project Repository URL
https://github.com/sujithrkumar/ms_raghack
Deployed Endpoint URL
No response
Project Video
https://www.youtube.com/watch?v=IUSCWtB9jWk
Team Members
MayankKeshariC5, sujith-rkumar, maheshpandeycourse5, saurabhkanekar