tattle-made / feluda

A configurable engine for analysing multi-lingual and multi-modal content.
https://tattle.co.in/products/feluda/
GNU General Public License v3.0
10 stars 14 forks source link

[DMP 2024]: Clustering large amount of audio #82

Open dennyabrain opened 4 months ago

dennyabrain commented 4 months ago

Ticket Contents

Description

Feluda allows researchers, factcheckers and journalists to explore and analyze large quantity of multimeda content. One important modality on Indian social media is audio. The scope of this task is to explore various automated techniques suited for this grouping similar audio together and visualizing them. After consultation with the team, implement an end to end workflow that can be used to surface visual or temporal trends in a large collection of audio.

Goals

Expected Outcome

Feluda's goal is to provide a simple CLI or scriptable interface for Analysing multimodal social media data. In that vein, all the work that you do should be executable and configurable via scripts and config files. The solution should look at feluda's architecture and its various components to identify best ways to enable this. The solution should have a way to configure data source (database with file IDs or a S3 bucket with files), specify and implement the data processing pipeline and where the result will be stored. Our current implementation uses S3 and SQL database for data source and Elasticsearch for storing result but additional sources or stores can be added if apt for this project.

Acceptance Criteria

Implementation Details

One way we have approached this is by using Vector Embeddings. We have done this to great success to surface visual trends in Images. We used ResNet model to generate vector embeddings and store them in elasticsearch. We also used t-sne to reduce the dimensions of the vector embeddings to then display them in a 2D visualization. It can be viewed here A detailed report over feluda's usage in a project to analyze images can be read here The relevant feluda operator can be studied here The code for tsne is here A prior study of various ways to get insights out of images has been documented here

Mockups/Wireframes

This is an interactive visualization of Image clustering done using Feluda. Screenshot 2024-02-16 at 08-16-56 Tattle - articles

Doing UI development or integrating with any UI software is not part of this project but it might help to see what sort of downstream applications we use Feluda for.

Product Name

Feluda

Organisation Name

Tattle

Domain

Open Source Library

Tech Skills Needed

Machine Learning, Python

Mentor(s)

@dennyabrain @duggalsu

Category

Data Science, Machine Learning, Research

MadhukeshSingh commented 2 months ago

Hi there, @dennyabrain , I'm passionate about machine learning and keen on joining this project.

Here's a bit about myself: I am Madhukesh Singh, currently studying at the National Institute of Technology, Hamirpur, in my third year.

My experience includes working on image processing, computer vision, and object detection in satellite imagery during my internship as an AI developer at DRDO DYSL.AI.

Is there a preferred method for communicating with the mentors? I'm eager to contact you and explore how I can contribute.

dennyabrain commented 2 months ago

Hi @MadhukeshSingh we can use this issue to communicate approaches. If you start concretely implementing something, you can make a new issue specific to your approach and we can take the conversation there.

Tahseen23 commented 2 months ago

"Hi there, @dennyabrain! I want to contribute to this project, but I am new to open-source contribution. So, can you tell me what I have to do in this project and how to contribute?"

manisha1301 commented 2 months ago

Hi there, @dennyabrain , I'm passionate about machine learning and keen on joining this project and for the project because of a robust skill set encompassing advanced machine learning and natural language processing capabilities. my adaptability, efficiency in information retrieval, and quick learning make me a valuable asset for tasks requiring Machine Learning, AI-driven insights, data analysis, language language-related applications. I am equipped to contribute to the team's goal by leveraging cutting-edge AI technology and staying abreast of industry trends.

Here's a bit about myself: I am Manisha Sharma, currently studying at the Gd Goenka University, Gurugram, Haryana, 4th last year.

My experience includes working on deep learning, machine learning and artificial neural network, and artificial crypt analysis during my internship as an AI developer at Sag - DRDO and currently working in Interglobe Aviation as a data analyst internship.

Is there a preferred method for communicating with the mentors? I'm eager to contact you and explore how I can contribute.

poozasingh commented 2 months ago

Hi @dennyabrain, I'm Pooja Singh, a software developer intern at Verana Networks, passionate about machine learning and eager to join this project. My robust skill set in advanced machine learning and natural language processing, coupled with my adaptability and efficiency in information retrieval, make me a valuable asset.

I have hands-on experience in machine learning, and artificial intelligence from my internship as well as current work. I'm keen to connect with mentors to explore how I can contribute to the team's goals.

What's the preferred method for communication? Looking forward to hearing from you.

sreyash-layek commented 2 months ago

Hello @dennyabrain , I'm thrilled to delve into the Feluda project and its objectives. After reviewing the documentation, I noticed that my background aligns well with the project's needs.

A little about myself: My name is Sreyash Layek, and I'm currently in my fifth year at the Indian Institute of Technology, Kharagpur, pursuing a Dual Degree (Integrated B.Tech & M.Tech) with a specialization in Signal Processing and Machine Learning.

Over the past three years, I've dedicated myself to exploring Machine Learning, with a particular focus on Computer Vision and Natural Language Processing tasks. I've spent a year working on Speech Processing and Accent Conversion, achieving results close to the state-of-the-art. Additionally, I've developed models for various applications, including Attention Monitoring, Accident Classification, Audio Classification, Emotion Classification, Recommendation Systems, and more.

I bring to the table over five years of experience in Python and three years in Machine Learning and Deep Learning. I'm eager to learn more about the project and discuss how I can contribute. I'd be interested in understanding your expectations and the specific requirements for this project.

Could we explore this further?

Sbswag commented 2 months ago

Hello @dennyabrain , My name is Surjeet bijarniya and I am a student of IIT bhu and passionate about machine learning and eager to join this project. But I am new in machine learning sir, tell me how I contribute

KAMERAVAMSHI commented 2 months ago

Hello @dennyabrain! I'm enthusiastic about machine learning and eager to be part of this project.

Allow me to introduce myself: I'm Kamera Vamshi, currently I am Pursuing my B.Tech Final year at the National Institute of Technology, Rourkela (NIT Rourkela).

My background involves significant experience in Machine Learning, Python, and Data Analysis. I honed these skills during my internship and Projects.

Could you please advise on the preferred method for reaching out to mentors? I'm keen to connect and discuss how I can contribute to the project.

AkanshuAich commented 2 months ago

Hii @dennyabrain ,

I am Akanshu Aich, a third year BTech student from International Institute of Information Technology, Bhubaneswar. I am writing to express my interest in contributing to this project as a part of DMP 2024. Having thoroughly reviewed the project, I am impressed by its objectives and it seeks the potential for great impact in industries.

With my background in Backend using Django , MERN with practicing hands on Machine learning and DevOps such as Docker, I believe I can make valuable contributions to Machine learning part . My experience includes several projects like Society-Expenditure Manager using Django, Real Estate using MERN and Info-Finding Tool using Machine Learning(LLM), which I believe align well with the goals of your project.

I am particularly interested in fulfilling the requirements of the project and have some ideas on how to approach it effectively. I am committed to adhering to best practices, contributing high-quality code, and actively collaborating with the project maintainers and community.

I am excited about the opportunity to contribute to "Feluda" and help further its mission. I look forward to discussing potential contributions and how I can best support the project.

Please guide me with procedure and with all your knowledge and experience.

manavsolkar commented 2 months ago

Hello @dennyabrain! I'm enthusiastic about machine learning and eager to be part of this project.

Allow me to introduce myself: I'm Manav Solkar, currently I am Pursuing my B.Tech second year at Thakur College of Engineering and Technology (TCET).

I really want to be a part of this and hope that your guidance would help me to increase my skillset .

Could you please advise on the preferred method for reaching out to mentors? I'm keen to connect and discuss how I can contribute to the project

Tatwansh commented 2 months ago

Hey @dennyabrain and @duggalsu, I am interested to work on this project. I have prior experience working on project with similar objectives on the QAnon dataset. You can check out my work with the provided link.

notebook link: https://www.kaggle.com/code/tatwanshjaiswal/dark-web-language-analysis

I would be happy to receive feedback on how to improve it.

AbhimanyuSamagra commented 2 months ago

Do not ask process related questions about how to apply and who to contact in the above ticket. The only questions allowed are about technical aspects of the project itself. If you want help with the process, you can refer instructions listed on Unstop and any further queries can be taken up on our Discord channel titled DMP queries.

ashuashutosh2211 commented 2 months ago

Hey @dennyabrain and @duggalsu, I am Ashutosh pursuing B.Tech. in Artificial Intelligence and Data Science from IIT Jodhpur. I am proficient in languages like Python and C++. I have worked on projects related to machine learning and deep learning such as Stock Price Prediction and Voice Controlled Music Recommendation System using Deep Learning. I am interested to work on this project and apply my skills in the project.

dennyabrain commented 2 months ago

Hi everyone,

Thank you for expressing interest in this issue. Depending on your interests and skills, you can take ANY ONE of the following approaches :

  1. Look at the problem statement and propose your approach Remember the main problem statement - Given a large number of audio files, find a way to group identical and similar audio files. This approach would be ideal for anyone who is interested in or studies ML and/or DSP. By thinking about the problem statement, reviewing existing literature on it and proposing your approach here, we would all learn something from it and the mentors should be able to nudge you in the right direction.

  2. Try getting feluda working on your machine Feluda is a moderately complex software and has many moving parts. Getting it working on your machine itself can be a challenge. We have a guide on it here. If you are is a software developer/tinkerer, this might be a good place to start because once you have Feluda working locally and you can see the various existing functionalities, that might give you an idea of how to proceed.

  3. Recreate our code on a jupyter notebook or google collab notebook We already have some code that takes audio files and converts them into vectors. We also have code that takes these vectors and clusters them. I would take this approach if you are a software engineer with some ML engineering skills and you know your way around using ML models. Once you get this working on your notebook we can try out different pretrained models to evaluate performance.

You'll have me or members from our team to guide if you get stuck on any of these approaches. Taking some conrete steps on any of these 3 steps would help us know what your interests and skills are and give you concrete feedback when you get stuck.

All the best!

vishakha72 commented 2 months ago

Hello @dennyabrain I really want to contribute in this project. I have good hands on experience on python, Machine learning, Databases, Deep Learning. I am Data Science student and really enthusiast to work in your project. From past 3 years, I have done a lot of real time projects, I have also done many internships to gain the hands on experience. I want to learn and gain experience in deep way by working on this project. Please allow me to work with your project.

Satyam0775 commented 2 months ago

Hello @dennyabrain,

I'm eager to contribute to your project. With substantial experience in Python, machine learning, databases, and deep learning, I believe I can make valuable contributions. As a data science student, I've spent the past three years working on various real-world projects and completing internships to hone my skills. I'm enthusiastic about delving deeper into the field and gaining practical experience through involvement in your project. I'm eager to learn and collaborate effectively. Please consider allowing me to be part of your team.

AbhimanyuSamagra commented 2 months ago

Do not ask process related questions about how to apply and who to contact in the above ticket. The only questions allowed are about technical aspects of the project itself. If you want help with the process, you can refer instructions listed on Unstop and any further queries can be taken up on our Discord channel titled DMP queries. Here's a Video Tutorial on how to submit a proposal for a project.

Chaithanya512 commented 2 months ago

hi @dennyabrain,

I am Chaithanya Kalyan. I am interested in contributing to this project.

I have experience working with time series signals. As part of the PhysioNet 2023 challenge, time domain and frequency domain features were extracted to classify the EEG signals (more details here).

I have a doubt regarding the details of this project and would greatly appreciate the clarification:

  1. Does this clustering algorithm have to be scalable to different datasets (like a general framework that can be extended ) or is it only for a specific dataset?

I think the following approach will be worth trying: without extracting the traditional audio features, we can train an autoencoder network on a large audio collection to automatically learn a low-level representation of the audio signals and cluster based on these latent representations.

I have tried a similar approach on EEG signals before, you can find that notebook here.

I would be happy to hear your feedback.

contact: chay5522kalyan@gmail.com

dennyabrain commented 2 months ago

Hi @Chaithanya512,

Given that the project focus is on addressing usecases around online misinformation, the dataset we deal with is usually audio/video found on social media. So it can contain a variety of audio - memes, news clipping, amateur recording from phones etc.

Is there a quick way to validate if the autoencoder network approach would be suitable for this use case? What is your rationale to preferring that over extracting traditional audio features?

Chaithanya512 commented 2 months ago

Thank you for the feedback, I am currently working on the code to validate the use of autoencoders.

Compared to traditional, hand-crafted features, autoencoders have the potential to capture a wider range of features. While traditional audio features are valuable, they might miss some subtle patterns in the data that autoencoders can discover.

I have a follow-up question (might be stupid) for your response, please correct me if I am wrong.

I'm curious, do you think traditional audio features are effective in clustering misinformation and not-misinformation? do those features vary for misinformation and not-misinformation?

dennyabrain commented 2 months ago

So we wont be using the clusters to classify something as "misinformation" and "not misinformation". We're hoping to use clustering as a way to find first level of grouping amongst a large dataset. So most likely the clusters could be something high level like "memes", "amateur-smartphone" etc. If we are lucky we could aspire for thematic labels like "politics", "health" etc.

An example of clustering we did on images is here - https://tattle.co.in/articles/covid-whatsapp-public-groups/t-sne/ The clusters we got then were - Screenshots(Social Media), Screenshots(Other), Medical Supplies, Paper Documents, Religious Imagery etc

Chaithanya512 commented 2 months ago

thank you for the clarification. That makes sense now. So, we are using clustering only to find the high-level labels/pseudo labels. I have found this paper that uses labeled data (only text) to categorize misinformation posters or active citizens on social media. It got me thinking - if we could obtain the transcriptions of the audio content (if that is possible), that information could significantly enhance our clustering efforts.

dennyabrain commented 2 months ago

@Chaithanya512 yes that would certainly help. Infact when we do clustering for images, we often try to extract any text out of it as a way to get a richer dataset. You can certainly try transcriptions for audio content. One challenge might be that we are dealing with non English languages and also low quality audio.

preeti13456 commented 2 months ago

hey can I work on this issue I have work on speech attenuation in the past so kind of familiar with problem statemnet indly let me know

Ahmedfurkhan commented 2 months ago

Hey !! I Want to work on this

Ankita-Mohan commented 2 months ago

Hi there, @dennyabrain, I am Ankita Mohan, I am a third-year student at Kalinga Institute of Industrial Technology, Odisha. I'm passionate about machine learning and keen on joining this project. Moreover, I have a deep understanding of clustering algorithms as I have done projects in clustering. I am eager to contribute and to gain your guidance for the same.

Pushkar0730 commented 2 months ago

I would definitely like to work on it ☺️

dennyabrain commented 2 months ago

Hi all thanks for your enthusiasm. Please let me know if you have any specific ideas on how you would go about the project.

Please refer to this comment for some suggested ways to move forward https://github.com/tattle-made/feluda/issues/82#issuecomment-2058794148

PriyalPB commented 2 months ago

Hi @dennyabrain ! I'm a third year student from Cummins Pune.

I'm thrilled to join your Clustering large amount of audio project and offer my skill sets which has a strong background in Machine Learning ,deep learning (CNN), NLP, DSP and Python, which seem to fit perfectly with what you're looking for. I'm excited to explore how my expertise can elevate the project. Furthermore, the integration computer vision along with the ML advancements could lead to a seamlessly automated system. I'm eager to discuss further avenues where I can make meaningful contributions. Could we schedule a meeting to delve into this in more detail?

CodeSage4 commented 1 month ago

My skills in machine learning (computer vision, NLP) and experience with speech processing align well with the Feluda project. I'm a motivated student with 3+ years of Python experience and 2 years in ML/DL. Eager to discuss how I can contribute!

VDinesh03 commented 1 month ago

Hi @dennyabrain , Myself V Dinesh Third Year Mechanical student from Army Institute of Technology Pune. I'm passionate about machine learning and keen on joining this project. In addition, my expertise in clustering algorithms extends to a profound level, acquired through hands-on experience gained from multiple projects focused specifically on implementing and fine-tuning various clustering techniques. These projects have provided me with a comprehensive understanding of the underlying principles, nuances, and practical applications of clustering algorithms across diverse domains, allowing me to effectively navigate through complex datasets, identify patterns, and extract meaningful insights. I am enthusiastic about contributing my expertise and am eager to receive your guidance in order to further enhance my capabilities in this regard.

pandharkardeep commented 1 month ago

Hi @dennyabrain . I am Deep Pandharkar, second year Data Science Engineering student from DJ Sanghvi College of Engineering Mumbai. I have a some experience in CV as well as NLP. My passion towards ML makes me keen towards joining this project. In addition to that, I have practised a lot of vector embeddings as a part of my NLP projects. I also have coding experience in Data Structures and Algorithms. Eager to discuss how can I contribue

Sufia-ahmad commented 1 month ago

I am Sufia, and I graduated with B.tech CSE, I am Data scientist and also full stack developer, but I am fresher I hv only completed 6 months of training in the entire field and one month of Internship so, I want to do the internship.

aatmanvaidya commented 1 week ago

Weekly Goals

Week 1

Week 2

poozasingh commented 1 week ago

Hello can you please make a call in 6299 143 824.

On Tue, 18 Jun, 2024, 17:57 Aatman Vaidya, @.***> wrote:

Weekly Goals Week 1

  • Setup Feluda and run tests for AudioVecEmbedding Operator
  • Collect a dataset of 150-200 Audio Files
  • Run Feluda AudioVec Operator on a the dataset, reduce dimensions using t-SNE and do a visual plot - This will act as a baseline for us
  • Try out different Embedding Models

— Reply to this email directly, view it on GitHub https://github.com/tattle-made/feluda/issues/82#issuecomment-2175978629, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBTKGEM2LSVESBNH274GGNTZIARRRAVCNFSM6AAAAABDLKDTISVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZVHE3TQNRSHE . You are receiving this because you commented.Message ID: @.***>

Chaithanya512 commented 2 days ago

Weekly Learnings and Updates:

Week 1:

Colab file: https://colab.research.google.com/drive/1lBrWCyUsuCSTOEUUqDwfc6FzpQWO0ETt?usp=sharing