parth126 / IT550

Project Proposals for the IT-550 Course (Autumn 2024)
0 stars 0 forks source link

Enhancing Information Retrieval Using Topic Modeling #26

Open KavishaMadani opened 1 month ago

KavishaMadani commented 1 month ago

Title

Enhancing Information Retrieval Using Topic Modeling

Team Name

InfoSphere

Email

202318007@daiict.ac.in

Team Member 1 Name

Kavisha Madani

Team Member 1 Id

202318007

Team Member 2 Name

Vishaka Nair

Team Member 2 Id

202318041

Team Member 3 Name

Srushti Bhagchandani

Team Member 3 Id

202318047

Team Member 4 Name

Shubham Gupta

Team Member 4 Id

202318052

Category

Optimizing an existing system

Problem Statement

This project aims to enhance information retrieval systems by integrating Topic Modeling, specifically Latent Dirichlet Allocation (LDA) and possibly by Fast Deterministic CUR based approach with Retrieval-Augmented Generation (RAG). By training an LDA model on a diverse document corpus, we will categorize documents into distinct topics and associate each document with a topic distribution. Incoming queries will be processed to infer their topic distribution, allowing for the augmentation of queries with relevant keywords. The retrieval mechanism will prioritize documents that align thematically with the query, combining traditional similarity metrics with topic similarity. This approach will ensure that the responses generated are not only contextually relevant but also deeply aligned with the underlying topics of interest.

Evaluation Strategy

Precision, Recall, and F1-score: These will measure the relevance of retrieved documents. Query Response Time: This will assess the efficiency of the proposed model, with an aim to make the retrieval process at least 10% faster than the baseline. Topic Relevance Score: This will evaluate how well the documents retrieved match the query’s thematic topics using topic coherence scores. Human Feedback Evaluation: Using a small sample, human evaluators will assess the accuracy and relevance of responses generated by the RAG system.

Dataset

https://www.kaggle.com/datasets/thedevastator/uncovering-financial-insights-with-the-reuters-2?select=ModApte_train.csv, https://www.kaggle.com/code/pranjalsoni17/topic-modelling-using-lda,

Resources

Paper Title - Latent Dirichlet Allocation, by David M. Blei, Andrew Y. Ng, and Michael I. Jordan Paper Link - https://www.researchgate.net/publication/221620547_Latent_Dirichlet_Allocation Paper Title - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" by Lewis et Paper Link - https://arxiv.org/abs/2005.11401 Paper Title - Fast Deterministic CUR Matrix Decomposition with Accuracy Assurance. by Yasutoshi Ida, Sekitoshi Kanai, Yasuhiro Fujiwara, Tomoharu Iwata, Koh Takeuchi, Hisashi Kashima Paper Link - https://proceedings.mlr.press/v119/ida20a.html

parth126 commented 1 month ago

@KavishaMadani There are two proposals from your team. Please close the one that is not relevant

KavishaMadani commented 1 month ago

Sir, I have closed the proposal titled 'Fashion Product Retrieval Using Semantic Search and Natural Language Generation'.

On Wed, Sep 25, 2024 at 11:07 AM Parth Mehta @.***> wrote:

@KavishaMadani https://github.com/KavishaMadani There are two proposals from your team. Please close the one that is not relevant

— Reply to this email directly, view it on GitHub https://github.com/parth126/IT550/issues/26#issuecomment-2373065292, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZP5OPFR577HR4DZBZWJPE3ZYJDYLAVCNFSM6AAAAABOQQL32WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZTGA3DKMRZGI . You are receiving this because you were mentioned.Message ID: @.***>

parth126 commented 1 month ago

Ok. Marking as approved.