Open Sobhan-202318040 opened 2 days ago
Issues in the proposal:
Try to answer these questions by tomorrow. In the current state, this problem definition is not acceptable.
1.Dataset: No project is valid without a valid dataset: We understand the need for a valid dataset, and we plan to work with a collection of digital PDF documents. This will allow us to evaluate our system’s ability to efficiently extract and retrieve information from PDFs with various structures and contents. We may also consider publicly available datasets from sources like Kaggle or research repositories to further enhance the dataset.
(ii)How is it different from retrieving text documents?: This project is different from standard text retrieval because PDFs often present challenges such as:
(iii)Are there any PDF-specific problems?: Yes, PDFs have unique issues, such as:
(iv)Will the PDFs be scanned or digital?: Our primary focus will be on digital PDFs for this project, ensuring efficient extraction and retrieval. However, if time permits, we plan to incorporate OCR technology to handle scanned PDFs as well, allowing us to broaden the scope and include both types of documents.
3.What approach will be used for matching? Will it be done within Terrier, Elasticsearch, etc.? Or implemented from scratch?:
We will implement the matching process using vector embeddings, converting text into vectors for semantic search (most probably FAISS ). This enables context-aware retrieval from PDFs. We may integrate with Elasticsearch or Pinecone for scalable indexing and faster querying, but the core functionality will rely on Langchain and Google Gemini for semantic understanding
@Sobhan-202318040 Please upload a block diagram of the entire end to end system for better explanation. You can attach it here. There are several issues with your explanation, and putting the entire pipeline on paper will help you identify those.
Title
Documents Q&A Assistant: Integrating Chatbot Technology for Efficient Information Retrieval using RAG
Team Name
Team Believers
Email
202318040@daiict.ac.in
Team Member 1 Name
Kushal Barot
Team Member 1 Id
202318006
Team Member 2 Name
Aayush Vithalani
Team Member 2 Id
202318023
Team Member 3 Name
Harshil Shah
Team Member 3 Id
202318033
Team Member 4 Name
Sobhan Behuria
Team Member 4 Id
202318040
Category
Optimizing an existing system
Problem Statement
This project aims to improve the retrieval of relevant information from large PDF documents, addressing the limitations of traditional keyword searches by using vector embeddings and advanced AI for precise, context-aware answers.
Evaluation Strategy
The evaluation strategy includes measuring precision and recall for information retrieval accuracy, assessing response relevance and user satisfaction, testing retrieval speed and scalability, evaluating query handling for complexity, and analyzing usability and error rates for system performance.
Dataset
The project involves handling diverse and user-specific PDFs, which means evaluation is based on actual system performance with real documents rather than a static dataset. The system’s effectiveness is assessed through its ability to process and respond to various user queries across different types of PDFs, making a fixed dataset insufficient for comprehensive evaluation.
Resources
[1] Petr Baudiš and Jan Šedivy. Modeling of the question answering task in the yodaqa system. In ` International Conference of the Cross-Language Evaluation Forum for European Languages, pages 222–228. Springer, 2015. URL https://link.springer.com/chapter/10.1007% 2F978-3-319-24027-5_20. [2] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/ D13-1160. [3] X. Cheng, D. Luo, X. Chen, L. Liu, D. Zhao, and R. Yan, “Lift yourself up: Retrieval-augmented text generation with self memory,” arXiv preprint arXiv:2305.02437, 2023. [4]↑S. Wang, Y. Xu, Y. Fang, Y. Liu, S. Sun, R. Xu, C. Zhu, and M. Zeng, “Training data is more valuable than you think: A simple and effective method by retrieving from training data,” arXiv preprint arXiv:2203.08773, 2022.