Shruti-Drishti is an innovative project aimed at addressing the communication gap between the deaf and non-deaf communities in South Asia, particularly in India. By leveraging deep learning models and state-of-the-art techniques, we strive to facilitate seamless communication and promote inclusivity for individuals with hearing impairments. 🌟
Sign Language to Text Conversion 🖐️➡️📝: Our custom Transformer-based Multi-Headed Attention Encoder, powered by Google's Tensorflow Mediapipe, accurately converts sign language videos into text, overcoming challenges related to dynamic sign similarity.
Text to Sign Language Generation 📝➡️🖐️: Utilizing an Agentic LLM framework, Shruti-Drishti converts textual information into masked keypoints based sign language videos, tailored specifically for Indian Sign Language.
Multilingual Support 🌐: Our app uses IndicTrans2 for multilingual support for all 22 scheduled Indian Languages. Accessibility is our top priority, and we make sure that everyone is included.
Content Accessibility 📰🎥: Shruti-Drishti enables news channels and content creators to expand their user base by making their content accessible and inclusive through embedded sign language video layouts.
Link to the Dataset: INCLUDE Dataset
The INCLUDE dataset, sourced from AI4Bharat, forms the foundation of our project. It consists of 4,292 videos, with 3,475 videos used for training and 817 videos for testing. Each video captures a single Indian Sign Language (ISL) sign performed by deaf students from St. Louis School for the Deaf, Adyar, Chennai.
Shruti-Drishti employs two distinct models for real-time Sign Language Detection:
LSTM-based Model 📈: Leveraging keypoints extracted from Mediapipe for poses, this model utilizes a recurrent neural network (RNN) and Long-Short Term Memory Cells for evaluation.
Transformer-based Model 🔄: Trained through extensive experimentation and hyperparameter tuning, this model offers enhanced performance and adaptability.
We have also implemented the VideoMAE model, proposed in the paper "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training." Fine-tuning techniques such as qLORA, peft, head and backbone fine-tuning, and only head fine-tuning were explored, with the latter proving to be the most successful approach.
Shruti-Drishti tackles the communication gap through a two-fold approach:
Sign Language to Text: Implementing a custom Transformer-based Multi-Headed Attention Encoder using Google's Tensorflow Mediapipe, we convert sign language videos into text while addressing challenges related to dynamic sign similarity.
Text to Sign Language: Utilizing an Agentic LLM framework, Shruti-Drishti converts textual information into masked keypoints based sign language videos, tailored specifically for Indian Sign Language.
Pose-to-Text Implementation: Develop and implement a Pose-to-Text model based on the referenced paper for the Indian Sign Language dataset, using Agentic langchain based state flow as the decoder stage for text-to-gloss conversion and merging masked keypoint videos.
Custom Transformer Model Evaluation: Assess the effectiveness of our custom Transformer/LSTM model on the Sign Language Dataset, focusing on accuracy and adaptability to dynamic signs.
Multilingual App Development: Create a user-friendly multilingual app serving as an interface for our Sign Language Translation services, ensuring easy interaction and adoption by both deaf and non-deaf users.
For detailed results and insights, please refer to our presentation slides.
(TODO)